Prediction of fundamental frequency and voicing from mel-frequency cepstral coefficients for unconstrained speech reconstruction

Ben Milner, Xu Shao

Research output: Contribution to journalArticlepeer-review

51 Citations (Scopus)


This work proposes a method for predicting the fundamental frequency and voicing of a frame of speech from its mel-frequency cepstral coefficient (MFCC) vector representation. This information is subsequently used to enable a speech signal to be reconstructed solely from a stream of MFCC vectors and has particular application in distributed speech recognition systems. Prediction is achieved by modeling the joint density of fundamental frequency and MFCCs. This is first modeled using a Gaussian mixture model (GMM) and then extended by using a set of hidden Markov models to link together a series of state-dependent GMMs. Prediction accuracy is measured on unconstrained speech input for both a speaker-dependent system and a speaker-independent system. A fundamental frequency prediction error of 3.06% is obtained on the speaker-dependent system in comparison to 8.27% on the speaker-independent system. On the speaker-dependent system 5.22% of frames have voicing errors compared to 8.82% on the speaker-independent system. Spectrogram analysis of reconstructed speech shows that highly intelligible speech is produced with the quality of the speaker-dependent speech being slightly higher owing to the more accurate fundamental frequency and voicing predictions
Original languageEnglish
Pages (from-to)24-33
Number of pages10
JournalIEEE Transactions on Audio, Speech, and Language Processing
Issue number1
Publication statusPublished - 2007

Cite this