Abstract
In this paper we describe a real-time speech-driven method for synthesising realistic video sequences of a subject enunciating arbitrary phrases. In an offline training phase an active appearance model (AAM) is constructed from hand-labelled images and is used to encode the face of a subject reciting a few training sentences. Canonical correlation analysis (CCA) coupled with linear regression is then used to model the relationship between auditory and visual features, which is later used to predict visual features from the auditory features for novel utterances.
We present results from experiments conducted: 1) to determine the suitability of several auditory features for use in an AAM-based speech-driven talking head, 2) to determine the effect of the size of the training set on the correlation between the auditory and visual features, 3) to determine the influence of context on the degree of correlation, and 4) to determine the appropriate window size from which the auditory features should be calculated. This approach shows promise and a longer term goal is to develop a fully expressive, three-dimensional talking head.
Original language | English |
---|---|
Publication status | Published - 2007 |
Event | International Conference on Audio-Visual Speech Processing (AVSP) - Kasteel Groenendaal, Hilvarenbeek, Netherlands Duration: 31 Aug 2007 → 3 Sep 2007 |
Conference
Conference | International Conference on Audio-Visual Speech Processing (AVSP) |
---|---|
Country/Territory | Netherlands |
City | Hilvarenbeek |
Period | 31/08/07 → 3/09/07 |