Abstract
This work examines whether visual speech infor- mation can be effective within audio masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map visual speech features to an audio feature space from which both visually-derived binary masks and visually- derived ratio masks are estimated, before application to the speech mixture. Secondly, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only and audio-visual masking methods of speaker separation at mixing levels from -10dB to +10dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.
| Original language | English |
|---|---|
| Pages (from-to) | 1742-1754 |
| Number of pages | 13 |
| Journal | IEEE Transactions on Audio, Speech, and Language Processing |
| Volume | 26 |
| Issue number | 10 |
| Early online date | 18 May 2018 |
| DOIs | |
| Publication status | Published - Oct 2018 |
Profiles
-
Ben Milner
- School of Computing Sciences - Senior Lecturer
- Data Science and AI - Member
Person: Research Group Member, Academic, Teaching and Research
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver