This work examines whether visual speech infor- mation can be effective within audio masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map visual speech features to an audio feature space from which both visually-derived binary masks and visually- derived ratio masks are estimated, before application to the speech mixture. Secondly, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only and audio-visual masking methods of speaker separation at mixing levels from -10dB to +10dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.
|Number of pages||13|
|Journal||IEEE Transactions on Audio, Speech, and Language Processing|
|Early online date||18 May 2018|
|Publication status||Published - Oct 2018|