Using visual speech information in masking methods for audio speaker separation

Faheem Khan, Ben P. Milner, Thomas Le Cornu

Research output: Contribution to journalArticlepeer-review

7 Citations (Scopus)
24 Downloads (Pure)


This work examines whether visual speech infor- mation can be effective within audio masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map visual speech features to an audio feature space from which both visually-derived binary masks and visually- derived ratio masks are estimated, before application to the speech mixture. Secondly, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only and audio-visual masking methods of speaker separation at mixing levels from -10dB to +10dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.
Original languageEnglish
Pages (from-to)1742-1754
Number of pages13
JournalIEEE Transactions on Audio, Speech, and Language Processing
Issue number10
Early online date18 May 2018
Publication statusPublished - Oct 2018

Cite this