Abstract
Visemes are the visual equivalent of phonemes. Although not precisely defined, a working definition of a viseme is “a set of phonemes which have identical appearance on the lips”. Therefore a phoneme falls into one viseme class but a viseme may represent many phonemes: a many to one mapping. This mapping introduces ambiguity between phonemes when using viseme classifiers. Not only is this ambiguity damaging to the performance of audio-visual classifiers operating on real expressive speech, there is also considerable choice between possible mappings.
In this paper we explore the issue of this choice of viseme-to-phoneme map. We show that there is definite difference in performance between viseme-to-phoneme mappings and explore why some maps appear to work better than others. We also devise a new algorithm for constructing phoneme-to-viseme mappings from labeled speech data. These new visemes, ‘Bear’ visemes, are shown to perform better than previously known units.
In this paper we explore the issue of this choice of viseme-to-phoneme map. We show that there is definite difference in performance between viseme-to-phoneme mappings and explore why some maps appear to work better than others. We also devise a new algorithm for constructing phoneme-to-viseme mappings from labeled speech data. These new visemes, ‘Bear’ visemes, are shown to perform better than previously known units.
Original language | English |
---|---|
Pages (from-to) | 40-67 |
Number of pages | 28 |
Journal | Speech Communication |
Volume | 95 |
Early online date | 29 Jul 2017 |
DOIs | |
Publication status | Published - Dec 2017 |
Keywords
- Lipreading
- Speaker-dependent
- Viseme
- Phoneme
- Resolution
- Speech recognition
- Classification
- Visual speech
- Visual units
Profiles
-
Richard Harvey
- School of Computing Sciences - Professor
Person: Research Group Member, Academic, Teaching & Research