Abstract
This paper presents preliminary experiments using the Kaldi toolkit to investigate audiovisual speech recognition (AVSR) in noisy environments using deep neural networks (DNNs). In particular we use a single-speaker large vocabulary, continuous audiovisual speech corpus to compare the performance of visual-only, audio-only and audiovisual speech recognition. The models trained using the Kaldi toolkit are compared with the performance of models trained using conventional hidden Markov models (HMMs). In addition, we compare the performance of a speech recognizer both with and without visual features over nine different SNR levels of babble noise ranging from 20dB down to -20dB. The results show that the DNN outperforms conventional HMMs in all experimental conditions, especially for the lip-reading only system, which achieves a gain of 37.19% accuracy (84.67% absolute word accuracy). Moreover, the DNN provides an effective improvement of 10 and 12dB SNR respectively for both the single modal and bimodal speech recognition systems. However, integrating the visual features using simple feature fusion is only effective in SNRs at 5dB and above. Below this the degradion in accuracy of an audiovisual system is similar to the audio only recognizer. Index Terms: lip-reading, speech reading, audiovisual speech recognition
Original language | English |
---|---|
Title of host publication | FAAVSP - The 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing |
Pages | 127-131 |
Number of pages | 5 |
Publication status | Published - Sep 2015 |
Event | FAAVSP - The 1st Joint Conference on Facial Analysis, Animation and Auditory-Visual Speech Processing - Austria, Vienna, Austria Duration: 11 Sep 2015 → 13 Sep 2015 http://www.isca-speech.org/archive/avsp15/av15_127.html |
Conference
Conference | FAAVSP - The 1st Joint Conference on Facial Analysis, Animation and Auditory-Visual Speech Processing |
---|---|
Abbreviated title | FAAVSP 2015 |
Country/Territory | Austria |
City | Vienna |
Period | 11/09/15 → 13/09/15 |
Internet address |
Profiles
-
Richard Harvey
- School of Computing Sciences - Professor
- Smart Emerging Technologies - Member
Person: Research Group Member, Academic, Teaching & Research