Visual units and confusion modelling for automatic lip-reading

Dominic Howell, Stephen Cox, Barry Theobald

Research output: Contribution to journalArticle

18 Citations (Scopus)
8 Downloads (Pure)

Abstract

Automatic lip-reading (ALR) is a challenging task because the visual speech signal is known to be missing some important information, such as voicing. We propose an approach to ALR that acknowledges that this information is missing but assumes that it is substituted or deleted in a systematic way that can be modelled. We describe a system that learns such a model and then incorporates it into decoding, which is realised as a cascade of weighted finite-state transducers. Our results show a small but statistically significant improvement in recognition accuracy. We also investigate the issue of suitable visual units for ALR, and show that visemes are sub-optimal, not but because they introduce lexical ambiguity, but because the reduction in modelling units entailed by their use reduces accuracy.
Original languageEnglish
Pages (from-to)1-12
Number of pages12
JournalImage and Vision Computing
Volume51
Early online date1 Apr 2016
DOIs
Publication statusPublished - Jul 2016

Keywords

  • Lip-reading
  • Speech recognition
  • Visemes
  • Weighted finite state transducers
  • Confusion matrices
  • Confusion modelling

Cite this