This paper proposes an integrated speech front-end for both speech recognition and speech reconstruction applications. Speech is first decomposed into a set of frequency bands by an auditory model. The output of this is then used to extract both robust pitch estimates and MFCC vectors. Initial tests used a 128 channel auditory model, but results show that this can be reduced significantly to between 23 and 32 channels. A detailed analysis of the pitch classification accuracy and the RMS pitch error shows the system to be more robust than both comb function and LPC-based pitch extraction. Speech recognition results show that the auditory-based cepstral coefficients give very similar performance to conventional MFCCs. Spectrograms and informal listening tests also reveal that speech reconstructed from the auditory-based cepstral coefficients and pitch has similar quality to that reconstructed from conventional MFCCs and pitch.
|Number of pages||4|
|Publication status||Published - Sep 2003|
|Event||Eurospeech-2003 — 8th European Conference on Speech Communication and Technology - Geneva, Switzerland|
Duration: 1 Sep 2003 → 4 Sep 2003
|Conference||Eurospeech-2003 — 8th European Conference on Speech Communication and Technology|
|Period||1/09/03 → 4/09/03|