There are many circumstances in which it is useful or necessary to recognise phones rather than words, but phone recognition is inherently less accurate than word recognition. We describe here a post-recognition method for "translating" an errorful phone string output by a speech recogniser into a string that more closely matches the transcription. The technique owes something to Kohonen's idea of "dynamically expanding context" in that it learns from the errors made by the recogniser in a particular context, but it uses many contexts rather than a single context to estimate the "translation" of a recognised phone. The weights given to the different contexts in estimating the translation are determined discriminatively. On the WSJCAM0 database, the technique gives a 19.2% relative improvement in phone errors (including insertions) over the baseline, compared with a 6.2% improvement obtained using dynamically expanding context.
|Number of pages||4|
|Publication status||Published - 2004|
|Event||8th International Conference on Spoken Language Processing (Interspeech 2004) - Jeju Island, Korea|
Duration: 4 Oct 2004 → 8 Oct 2004
|Conference||8th International Conference on Spoken Language Processing (Interspeech 2004)|
|City||Jeju Island, Korea|
|Period||4/10/04 → 8/10/04|