The aim of this work is to investigate a selection of audio and visual speech features with the aim of finding pairs that maximise audio-visual correlation. Two audio speech features have been used in the analysis - filterbank vectors and the first four formant frequencies. Similarly, three visual features have also been considered - active appearance model (AAM), 2-D DCT and cross-DCT. From a database of 200 sentences, audio and visual speech features have been extracted and multiple linear regression used to measure the audio-visual correlation. Results reveal filterbank features to exhibit multiple correlation of around R=0.8 to visual features, while formant frequencies show substantially less correlation to visual features - R=0.6 for formants 1 and 2 and less than R=0.4 for formants 3 and 4. The three visual features show almost identical correlation to the audio features, varying in multiple correlation by less than 0.1, even though the methods of visual feature extraction are very different. Measuring the audio-visual correlation within each phoneme and then averaging the correlation across all phonemes showed an increase in correlation to R=0.9.
|Publication status||Published - 2007|
|Event||Auditory-Visual Speech Processing (AVSP2007) - Hilvarenbeek, Netherlands|
Duration: 31 Aug 2007 → 3 Sep 2007
|Conference||Auditory-Visual Speech Processing (AVSP2007)|
|Period||31/08/07 → 3/09/07|