Audio-to-Visual Speech Conversion using Deep Neural Networks

Sarah Taylor, Akihiro Kato, Ben Milner, Iain Matthews

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Citations (Scopus)
96 Downloads (Pure)

Abstract

We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal.
We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping
visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations
and perform a thorough analysis of our results.
Original languageEnglish
Title of host publicationProceedings of the Interspeech Conference 2016
PublisherInternational Speech Communication Association
Pages1482-1486
Number of pages5
DOIs
Publication statusPublished - Sep 2016
EventInterspeech 2016 - San Francisco, United States
Duration: 8 Sep 201612 Sep 2016

Conference

ConferenceInterspeech 2016
CountryUnited States
CitySan Francisco
Period8/09/1612/09/16

Cite this