Abstract
In this paper we present preliminary results of work towards a video-realistic visual speech synthesizer based on statistical models of shape and appearance. A sequence of images corresponding to an utterance is formed by concatenation of synthesis units (in this case triphones) from a pre-recorded inventory. Initial work has concentrated on a compact representation of human faces, accommodating an extensive visual speech corpus without incurring excessive storage costs. The minimal set of control parameters of a combined appearance model is selected according to formal subjective testing. We also present two methods used to build statistical models that account for the perceptually important regions of the face.
| Original language | English |
|---|---|
| Pages | 78-83 |
| Number of pages | 6 |
| Publication status | Published - Sept 2001 |
| Event | International Conference on Auditory-Visual Speech Processing (AVSP-2001) - Aalborg, Denmark Duration: 7 Sept 2001 → 9 Sept 2001 |
Conference
| Conference | International Conference on Auditory-Visual Speech Processing (AVSP-2001) |
|---|---|
| Country/Territory | Denmark |
| City | Aalborg |
| Period | 7/09/01 → 9/09/01 |
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver