Recent improvements in tracking and feature extraction mean that speaker-dependent lip-reading of continuous speech using a medium size vocabulary (around 1000 words) is realistic. However, the recognition of previously unseen speakers has been found to be a very challenging task, because of the large variation in lip-shapes across speakers and the lack of large, tracked databases of visual features, which are very expensive to produce. By adapting a technique that is established in speech recognition but has not previously been used in lip-reading, we show that error-rates for speaker-independent lip-reading can be very significantly reduced. Furthermore, we show that error-rates can be even further reduced by the additional use of Deep Neural Networks (DNN). We also find that there is no need to map phonemes to visemes for context-dependent visual speech transcription.
|Title of host publication||2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)|
|Number of pages||5|
|Publication status||Published - 19 May 2016|
|Event||2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - |
Duration: 20 Mar 2016 → 25 Mar 2016
|Conference||2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)|
|Period||20/03/16 → 25/03/16|