Improving Lip-reading Performance for Robust Audiovisual Speech Recognition using DNNs

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper presents preliminary experiments using the Kaldi toolkit to investigate audiovisual speech recognition (AVSR) in noisy environments using deep neural networks (DNNs). In particular we use a single-speaker large vocabulary, continuous audiovisual speech corpus to compare the performance of visual-only, audio-only and audiovisual speech recognition. The models trained using the Kaldi toolkit are compared with the performance of models trained using conventional hidden Markov models (HMMs). In addition, we compare the performance of a speech recognizer both with and without visual features over nine different SNR levels of babble noise ranging from 20dB down to -20dB. The results show that the DNN outperforms conventional HMMs in all experimental conditions, especially for the lip-reading only system, which achieves a gain of 37.19% accuracy (84.67% absolute word accuracy). Moreover, the DNN provides an effective improvement of 10 and 12dB SNR respectively for both the single modal and bimodal speech recognition systems. However, integrating the visual features using simple feature fusion is only effective in SNRs at 5dB and above. Below this the degradion in accuracy of an audiovisual system is similar to the audio only recognizer. Index Terms: lip-reading, speech reading, audiovisual speech recognition
Original languageEnglish
Title of host publicationFAAVSP - The 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing
Pages127-131
Number of pages5
Publication statusPublished - Sep 2015
EventFAAVSP - The 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing - Vienna, Austria
Duration: 11 Sep 201513 Sep 2015
http://www.isca-speech.org/archive/avsp15/av15_127.html

Conference

ConferenceFAAVSP - The 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing
CountryAustria
CityVienna
Period11/09/1513/09/15
Internet address

Cite this