The aim of this paper is to use visual speech information to create Wiener filters for audio speech enhancement. Wiener filters require estimates of both clean speech statistics and noisy speech statistics. Noisy speech statistics are obtained from the noisy input audio while obtaining clean speech statistics is more difficult and is a major problem in the creation of Wiener filters for speech enhancement. In this work the clean speech statistics are estimated from frames of visual speech that are extracted in synchrony with the audio. The estimation procedure begins by modelling the joint density of clean audio and visual speech features using a Gaussian mixture model (GMM). Using the GMM and an input visual speech vector a maximum a posterior (MAP) estimate of the audio feature is made. The effectiveness of speech enhancement using the visually-derived Wiener filter has been compared to a conventional audio-based Wiener filter implementation using a perceptual evaluation of speech quality (PESQ) analysis. PESQ scores in train noise at different signal-to-noise ratios (SNRs) show that the visuallyderived Wiener filter significantly outperforms the audio- Wiener filter at lower SNRs.
|Publication status||Published - 2007|
|Event||Auditory-Visual Speech Processing 2007 (AVSP2007) - Kasteel Groenendaal, Hilvarenbeek, Netherlands|
Duration: 31 Aug 2007 → 3 Sep 2007
|Conference||Auditory-Visual Speech Processing 2007 (AVSP2007)|
|Period||31/08/07 → 3/09/07|