In this paper we present initial work towards a video-realistic visual speech synthesiser based on statistical models of shape and appearance. A synthesised image sequence corresponding to an utterance is formed by concatenation of synthesis units (in this case phonemes) from a pre-recorded corpus of training data. A smoothing spline is applied to the concatenated parameters to ensure smooth transitions between frames and the resultant parameters applied to the model—early results look promising.
|Number of pages||4|
|Publication status||Published - May 2002|
|Event||IEEE International Conference on Acoustics, Speech and Signal Processing - Orlando, United States|
Duration: 13 May 2002 → 17 May 2002
|Conference||IEEE International Conference on Acoustics, Speech and Signal Processing|
|Abbreviated title||ICASSP- 2002|
|Period||13/05/02 → 17/05/02|