Abstract
We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal.
We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping
visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations
and perform a thorough analysis of our results.
We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping
visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations
and perform a thorough analysis of our results.
Original language | English |
---|---|
Title of host publication | Proceedings of the Interspeech Conference 2016 |
Publisher | International Speech Communication Association |
Pages | 1482-1486 |
Number of pages | 5 |
DOIs | |
Publication status | Published - Sept 2016 |
Event | Interspeech 2016 - San Francisco, United States Duration: 8 Sept 2016 → 12 Sept 2016 |
Conference
Conference | Interspeech 2016 |
---|---|
Country/Territory | United States |
City | San Francisco |
Period | 8/09/16 → 12/09/16 |
Profiles
-
Ben Milner
- School of Computing Sciences - Senior Lecturer
- Data Science and AI - Member
- Interactive Graphics and Audio - Member
- Smart Emerging Technologies - Member
Person: Research Group Member, Academic, Teaching & Research