Improved speaker independent lip reading using speaker adaptive training and deep neural networks

Ibrahim Almajai, Stephen Cox, Richard Harvey, Yuxuan Lan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

63 Citations (Scopus)
22 Downloads (Pure)

Abstract

Recent improvements in tracking and feature extraction mean that speaker-dependent lip-reading of continuous speech using a medium size vocabulary (around 1000 words) is realistic. However, the recognition of previously unseen speakers has been found to be a very challenging task, because of the large variation in lip-shapes across speakers and the lack of large, tracked databases of visual features, which are very expensive to produce. By adapting a technique that is established in speech recognition but has not previously been used in lip-reading, we show that error-rates for speaker-independent lip-reading can be very significantly reduced. Furthermore, we show that error-rates can be even further reduced by the additional use of Deep Neural Networks (DNN). We also find that there is no need to map phonemes to visemes for context-dependent visual speech transcription.
Original languageEnglish
Title of host publication2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublisherThe Institute of Electrical and Electronics Engineers (IEEE)
Pages2722-2726
Number of pages5
DOIs
Publication statusPublished - 19 May 2016
Event2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) -
Duration: 20 Mar 201625 Mar 2016

Conference

Conference2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Period20/03/1625/03/16

Cite this