Improving Lip-reading Performance for Robust Audiovisual Speech Recognition using DNNs

Kwanchiva Thangthai, Richard Harvey, Stephen Cox, Barry-John Theobald

Research output: Chapter in Book/Report/Conference proceedingConference contribution

31 Citations (Scopus)


This paper presents preliminary experiments using the Kaldi toolkit to investigate audiovisual speech recognition (AVSR) in noisy environments using deep neural networks (DNNs). In particular we use a single-speaker large vocabulary, continuous audiovisual speech corpus to compare the performance of visual-only, audio-only and audiovisual speech recognition. The models trained using the Kaldi toolkit are compared with the performance of models trained using conventional hidden Markov models (HMMs). In addition, we compare the performance of a speech recognizer both with and without visual features over nine different SNR levels of babble noise ranging from 20dB down to -20dB. The results show that the DNN outperforms conventional HMMs in all experimental conditions, especially for the lip-reading only system, which achieves a gain of 37.19% accuracy (84.67% absolute word accuracy). Moreover, the DNN provides an effective improvement of 10 and 12dB SNR respectively for both the single modal and bimodal speech recognition systems. However, integrating the visual features using simple feature fusion is only effective in SNRs at 5dB and above. Below this the degradion in accuracy of an audiovisual system is similar to the audio only recognizer. Index Terms: lip-reading, speech reading, audiovisual speech recognition
Original languageEnglish
Title of host publicationFAAVSP - The 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing
Number of pages5
Publication statusPublished - Sep 2015
EventFAAVSP - The 1st Joint Conference on Facial Analysis, Animation and Auditory-Visual Speech Processing - Austria, Vienna, Austria
Duration: 11 Sep 201513 Sep 2015


ConferenceFAAVSP - The 1st Joint Conference on Facial Analysis, Animation and Auditory-Visual Speech Processing
Abbreviated titleFAAVSP 2015
Internet address

Cite this