Abstract
This work examines whether visual speech infor- mation can be effective within audio masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map visual speech features to an audio feature space from which both visually-derived binary masks and visually- derived ratio masks are estimated, before application to the speech mixture. Secondly, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only and audio-visual masking methods of speaker separation at mixing levels from -10dB to +10dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.
Original language | English |
---|---|
Pages (from-to) | 1742-1754 |
Number of pages | 13 |
Journal | IEEE Transactions on Audio, Speech, and Language Processing |
Volume | 26 |
Issue number | 10 |
Early online date | 18 May 2018 |
DOIs | |
Publication status | Published - Oct 2018 |
Profiles
-
Ben Milner
- School of Computing Sciences - Senior Lecturer
- Interactive Graphics and Audio - Member
- Smart Emerging Technologies - Member
Person: Research Group Member, Academic, Teaching & Research