Visual speech synthesis using dynamic visemes, contextual features and DNNs

Ausdang Thangthai, Ben Milner, Sarah Taylor

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)
23 Downloads (Pure)

Abstract

This paper examines methods to improve visual speech synthesis from a text input using a deep neural network (DNN). Two representations of the input text are considered, namely into phoneme sequences or dynamic viseme sequences. From these sequences, contextual features are extracted that include information at varying linguistic levels, from frame level down to the utterance level. These are extracted from a broad sliding window that captures context and produces features that are input into the DNN to estimate visual features. Experiments first compare the accuracy of these visual features against an HMM baseline method which establishes that both the phoneme and
dynamic viseme systems perform better with best performance obtained by a combined phoneme-dynamic viseme system. An investigation into the features then reveals the importance of the
frame level information which is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic output.
Original languageEnglish
Title of host publicationProceedings of the Interspeech Conference 2016
PublisherInternational Speech Communication Association
Pages2458-2462
Number of pages5
DOIs
Publication statusPublished - Sept 2016
EventInterspeech 2016 - San Francisco, United States
Duration: 8 Sept 201612 Sept 2016

Conference

ConferenceInterspeech 2016
Country/TerritoryUnited States
CitySan Francisco
Period8/09/1612/09/16

Cite this