Abstract
We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal.
We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping
visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations
and perform a thorough analysis of our results.
We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping
visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations
and perform a thorough analysis of our results.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the Interspeech Conference 2016 |
| Publisher | International Speech Communication Association |
| Pages | 1482-1486 |
| Number of pages | 5 |
| DOIs | |
| Publication status | Published - Sept 2016 |
| Event | Interspeech 2016 - San Francisco, United States Duration: 8 Sept 2016 → 12 Sept 2016 |
Conference
| Conference | Interspeech 2016 |
|---|---|
| Country/Territory | United States |
| City | San Francisco |
| Period | 8/09/16 → 12/09/16 |
Profiles
-
Ben Milner
- School of Computing Sciences - Senior Lecturer
- Data Science and AI - Member
Person: Research Group Member, Academic, Teaching and Research