Abstract
We describe some high-level approaches to estimating confidence scores for the words output by a speech recognizer. By "high-level" we mean that the proposed measures do not rely on decoder specific "side information" and so should find more general applicability than measures that have been developed for specific recognizers. Our main approach is to attempt to decouple the language modeling and acoustic modeling in the recognizer in order to generate independent information from these two sources that can then be used for estimation of confidence. We isolate these two information sources by using a phone recognizer working in parallel with the word recognizer. A set of techniques for estimating confidence measures using the phone recognizer output in conjunction with the word recognizer output is described. The most effective of these techniques is based on the construction of "metamodels," which generate alternative word hypotheses for an utterance. An alternative approach requires no other recognizers or extra information for confidence estimation and is based on the notion that a word that is semantically "distant" from the other decoded words in the utterance is likely to be incorrect. We describe a method for constructing "semantic similarities" between words and hence estimating a confidence. Results using the UK version of the Wall Street Journal are given for each technique.
Original language | English |
---|---|
Pages (from-to) | 460-471 |
Number of pages | 12 |
Journal | IEEE Transactions on Speech and Audio Processing |
Volume | 10 |
Issue number | 7 |
DOIs | |
Publication status | Published - 2002 |