The Challenge of Multispeaker Lip-Reading

Stephen Cox, Richard Harvey, Yuxuan Lan, Jacob Newman, Barry-John Theobald

Research output: Contribution to conferencePaperpeer-review

67 Citations (Scopus)


In speech recognition, the problem of speaker variability has been well studied. Common approaches to dealing with it include normalising for a speaker's vocal tract length and learning a linear transform that moves the speaker-independent models closer to to a new speaker. In pure lip-reading (no audio) the problem has been less well studied. Results are often presented that are based on speaker-dependent (single speaker) or multispeaker (speakers in the test-set are also in the training-set) data, situations that are of limited use in real applications. This paper shows the danger of not using different speakers in the trainingand test-sets. Firstly, we present classification results on a new single-word database AVletters 2 which is a high-definition version of the well known AVletters database. By careful choice of features, we show that it is possible for the performance of visual-only lip-reading to be very close to that of audio-only recognition for the single speaker and multi-speaker configurations. However, in the speaker independent configuration, the performance of the visual-only channel degrades dramatically. By applying multidimensional scaling (MDS) to both the audio features and visual features, we demonstrate that lip-reading visual features, when compared with the MFCCs commonly used for audio speech recognition, have inherently small variation within a single speaker across all classes spoken. However, visual features are highly sensitive to the identity of the speaker, whereas audio features are relatively invariant.
Original languageEnglish
Number of pages6
Publication statusPublished - Sep 2008
EventInternational Conference on Auditory-Visual Speech Processing - Queensland, Australia
Duration: 26 Sep 200829 Sep 2008


ConferenceInternational Conference on Auditory-Visual Speech Processing

Cite this