A Mouth Full of Words: Visually Consistent Acoustic Redubbing

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)
104 Downloads (Pure)

Abstract

This paper introduces a method for automatic redubbing of video that exploits the many-to-many mapping of phoneme sequences to lip movements modelled as dynamic visemes [1]. For a given utterance, the corresponding dynamic viseme sequence is sampled to construct a graph of possible phoneme sequences that synchronize with the video. When composed with a pronunciation dictionary and language model, this produces a vast number of word sequences that are in sync with the original video, literally putting plausible words into the mouth of the speaker. We demonstrate that traditional, one-to-many, static visemes lack flexibility for this application as they produce significantly fewer word sequences. This work explores the natural ambiguity in visual speech and offers insight for automatic speech recognition and the importance of language modeling.
Original languageEnglish
Title of host publication2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublisherIEEE Press
Pages4904-4908
ISBN (Electronic)978-1-4673-6997-8
DOIs
Publication statusPublished - 6 Aug 2015
EventInternational Conference on Acoustics, Speech and Signal Processing - Brisbane, Australia
Duration: 19 Apr 201524 Apr 2015

Conference

ConferenceInternational Conference on Acoustics, Speech and Signal Processing
CountryAustralia
CityBrisbane
Period19/04/1524/04/15

Keywords

  • Audio-visual speech
  • dynamic visemes
  • acoustic redubbing

Cite this