Abstract
Recent improvements in tracking and feature extraction mean that speaker-dependent lip-reading of continuous speech using a medium size vocabulary (around 1000 words) is realistic. However, the recognition of previously unseen speakers has been found to be a very challenging task, because of the large variation in lip-shapes across speakers and the lack of large, tracked databases of visual features, which are very expensive to produce. By adapting a technique that is established in speech recognition but has not previously been used in lip-reading, we show that error-rates for speaker-independent lip-reading can be very significantly reduced. Furthermore, we show that error-rates can be even further reduced by the additional use of Deep Neural Networks (DNN). We also find that there is no need to map phonemes to visemes for context-dependent visual speech transcription.
Original language | English |
---|---|
Title of host publication | 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Publisher | The Institute of Electrical and Electronics Engineers (IEEE) |
Pages | 2722-2726 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 19 May 2016 |
Event | 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Duration: 20 Mar 2016 → 25 Mar 2016 |
Conference
Conference | 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
---|---|
Period | 20/03/16 → 25/03/16 |
Profiles
-
Richard Harvey
- School of Computing Sciences - Professor
Person: Research Group Member, Academic, Teaching & Research