Abstract
The application of neural network and convolutional neural net- work (CNN) architectures is explored for the tasks of voicing classification (classifying frames as being either non-speech, unvoiced, or voiced) and voice activity detection (VAD) of vi- sual speech. Experiments are conducted for both speaker de- pendent and speaker independent scenarios.
A Gaussian mixture model (GMM) baseline system is de- veloped using standard image-based two-dimensional discrete cosine transform (2D-DCT) visual speech features, achieving speaker dependent accuracies of 79% and 94%, for voicing classification and VAD respectively. Additionally, a single- layer neural network system trained using the same visual fea- tures achieves accuracies of 86 % and 97 %. A novel technique using convolutional neural networks for visual speech feature extraction and classification is presented. The voicing classifi- cation and VAD results using the system are further improved to 88 % and 98 % respectively.
The speaker independent results show the neural network system to outperform both the GMM and CNN systems, achiev- ing accuracies of 63 % for voicing classification, and 79 % for voice activity detection.
A Gaussian mixture model (GMM) baseline system is de- veloped using standard image-based two-dimensional discrete cosine transform (2D-DCT) visual speech features, achieving speaker dependent accuracies of 79% and 94%, for voicing classification and VAD respectively. Additionally, a single- layer neural network system trained using the same visual fea- tures achieves accuracies of 86 % and 97 %. A novel technique using convolutional neural networks for visual speech feature extraction and classification is presented. The voicing classifi- cation and VAD results using the system are further improved to 88 % and 98 % respectively.
The speaker independent results show the neural network system to outperform both the GMM and CNN systems, achiev- ing accuracies of 63 % for voicing classification, and 79 % for voice activity detection.
Original language | English |
---|---|
Publication status | Published - 2015 |
Event | FAAVSP - The 1st Joint Conference on Facial Analysis, Animation and Auditory-Visual Speech Processing - Austria, Vienna, Austria Duration: 11 Sept 2015 → 13 Sept 2015 http://www.isca-speech.org/archive/avsp15/av15_127.html |
Conference
Conference | FAAVSP - The 1st Joint Conference on Facial Analysis, Animation and Auditory-Visual Speech Processing |
---|---|
Abbreviated title | FAAVSP 2015 |
Country/Territory | Austria |
City | Vienna |
Period | 11/09/15 → 13/09/15 |
Internet address |
Profiles
-
Ben Milner
- School of Computing Sciences - Senior Lecturer
- Data Science and AI - Member
- Visual Computing and Signal Processing - Member
Person: Research Group Member, Academic, Teaching & Research