Table of Contents
Fetching ...

Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers

Prashant Serai, Peidong Wang, Eric Fosler-Lussier

TL;DR

The paper tackles predicting ASR errors for text-to-speech transcripts under modern neural acoustic models and off-the-shelf systems. It introduces a sampling-based ConfMat predictor and a context-aware Seq2Seq predictor to generate errorful phoneme sequences, converting them to WFSTs for decoding. Across Switchboard/Fisher and a cloud-based Virtual Patient dataset, the sampling approach substantially improves error-chunk recall and complete-utterance recall, with Seq2Seq offering context modeling but weaker generalization. The findings demonstrate that a peaky, sample-based confusion model better captures neural error behavior, enabling realistic simulated ASR data for robust NLP tasks and downstream training.

Abstract

Modeling the errors of a speech recognizer can help simulate errorful recognized speech data from plain text, which has proven useful for tasks like discriminative language modeling, improving robustness of NLP systems, where limited or even no audio data is available at train time. Previous work typically considered replicating behavior of GMM-HMM based systems, but the behavior of more modern posterior-based neural network acoustic models is not the same and requires adjustments to the error prediction model. In this work, we extend a prior phonetic confusion based model for predicting speech recognition errors in two ways: first, we introduce a sampling-based paradigm that better simulates the behavior of a posterior-based acoustic model. Second, we investigate replacing the confusion matrix with a sequence-to-sequence model in order to introduce context dependency into the prediction. We evaluate the error predictors in two ways: first by predicting the errors made by a Switchboard ASR system on unseen data (Fisher), and then using that same predictor to estimate the behavior of an unrelated cloud-based ASR system on a novel task. Sampling greatly improves predictive accuracy within a 100-guess paradigm, while the sequence model performs similarly to the confusion matrix.

Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers

TL;DR

The paper tackles predicting ASR errors for text-to-speech transcripts under modern neural acoustic models and off-the-shelf systems. It introduces a sampling-based ConfMat predictor and a context-aware Seq2Seq predictor to generate errorful phoneme sequences, converting them to WFSTs for decoding. Across Switchboard/Fisher and a cloud-based Virtual Patient dataset, the sampling approach substantially improves error-chunk recall and complete-utterance recall, with Seq2Seq offering context modeling but weaker generalization. The findings demonstrate that a peaky, sample-based confusion model better captures neural error behavior, enabling realistic simulated ASR data for robust NLP tasks and downstream training.

Abstract

Modeling the errors of a speech recognizer can help simulate errorful recognized speech data from plain text, which has proven useful for tasks like discriminative language modeling, improving robustness of NLP systems, where limited or even no audio data is available at train time. Previous work typically considered replicating behavior of GMM-HMM based systems, but the behavior of more modern posterior-based neural network acoustic models is not the same and requires adjustments to the error prediction model. In this work, we extend a prior phonetic confusion based model for predicting speech recognition errors in two ways: first, we introduce a sampling-based paradigm that better simulates the behavior of a posterior-based acoustic model. Second, we investigate replacing the confusion matrix with a sequence-to-sequence model in order to introduce context dependency into the prediction. We evaluate the error predictors in two ways: first by predicting the errors made by a Switchboard ASR system on unseen data (Fisher), and then using that same predictor to estimate the behavior of an unrelated cloud-based ASR system on a novel task. Sampling greatly improves predictive accuracy within a 100-guess paradigm, while the sequence model performs similarly to the confusion matrix.
Paper Structure (10 sections, 3 equations, 1 figure, 2 tables)

This paper contains 10 sections, 3 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: General pipeline for generating simulated errorful sequences: text transcripts are converted to phone sequences, which are converted to confusable phone lattices through the error prediction model. These are then decoded by a WSFT model incorporating a lexicon and language model.