In this paper we present preliminary results of work towards a video-realistic visual speech synthesizer based on statistical models of shape and appearance. A sequence of images corresponding to an utterance is formed by concatenation of synthesis units (in this case triphones) from a pre-recorded inventory. Initial work has concentrated on a compact representation of human faces, accommodating an extensive visual speech corpus without incurring excessive storage costs. The minimal set of control parameters of a combined appearance model is selected according to formal subjective testing. We also present two methods used to build statistical models that account for the perceptually important regions of the face.
|Number of pages||6|
|Publication status||Published - Sep 2001|
|Event||International Conference on Auditory-Visual Speech Processing (AVSP-2001) - Aalborg, Denmark|
Duration: 7 Sep 2001 → 9 Sep 2001
|Conference||International Conference on Auditory-Visual Speech Processing (AVSP-2001)|
|Period||7/09/01 → 9/09/01|