Audio-to-Visual Speech Conversion using Deep Neural Networks

Sarah Taylor, Akihiro Kato, Ben Milner, Iain Matthews

Research output: Chapter in Book/Report/Conference proceedingConference contribution

25 Citations (Scopus)
102 Downloads (Pure)


We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal.
We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping
visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations
and perform a thorough analysis of our results.
Original languageEnglish
Title of host publicationProceedings of the Interspeech Conference 2016
PublisherInternational Speech Communication Association
Number of pages5
Publication statusPublished - Sep 2016
EventInterspeech 2016 - San Francisco, United States
Duration: 8 Sep 201612 Sep 2016


ConferenceInterspeech 2016
Country/TerritoryUnited States
CitySan Francisco

Cite this