Table of Contents
Fetching ...

UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction

Jiaxin Guo, Minghan Wang, Xiaosong Qiao, Daimeng Wei, Hengchao Shang, Zongyao Li, Zhengzhe Yu, Yinglu Li, Chang Su, Min Zhang, Shimin Tao, Hao Yang

TL;DR

UCorrect addresses ASR error correction with an unsupervised Detector-Generator-Selector framework that locates erroneous characters, proposes candidate replacements via phonetic-filtered generation, and selects corrections using token-level probabilities. Built on pre-trained language models, it eliminates reliance on pseudo or original paired data and improves interpretability by isolating detection, generation, and selection steps. Empirical results on AISHELL-1 and WenetSpeech with Conformer and U2++ show substantial WER reductions, especially after fine-tuning, and competitive latency compared to end-to-end baselines, with notable reductions in false alarm rate. The approach demonstrates universality across decoding strategies and datasets, indicating practical impact for robust, scalable ASR error correction without heavy data dependence.

Abstract

Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER). Previous works usually adopt end-to-end models and has strong dependency on Pseudo Paired Data and Original Paired Data. But when only pre-training on Pseudo Paired Data, previous models have negative effect on correction. While fine-tuning on Original Paired Data, the source side data must be transcribed by a well-trained ASR model, which takes a lot of time and not universal. In this paper, we propose UCorrect, an unsupervised Detector-Generator-Selector framework for ASR Error Correction. UCorrect has no dependency on the training data mentioned before. The whole procedure is first to detect whether the character is erroneous, then to generate some candidate characters and finally to select the most confident one to replace the error character. Experiments on the public AISHELL-1 dataset and WenetSpeech dataset show the effectiveness of UCorrect for ASR error correction: 1) it achieves significant WER reduction, achieves 6.83\% even without fine-tuning and 14.29\% after fine-tuning; 2) it outperforms the popular NAR correction models by a large margin with a competitive low latency; and 3) it is an universal method, as it reduces all WERs of the ASR model with different decoding strategies and reduces all WERs of ASR models trained on different scale datasets.

UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction

TL;DR

UCorrect addresses ASR error correction with an unsupervised Detector-Generator-Selector framework that locates erroneous characters, proposes candidate replacements via phonetic-filtered generation, and selects corrections using token-level probabilities. Built on pre-trained language models, it eliminates reliance on pseudo or original paired data and improves interpretability by isolating detection, generation, and selection steps. Empirical results on AISHELL-1 and WenetSpeech with Conformer and U2++ show substantial WER reductions, especially after fine-tuning, and competitive latency compared to end-to-end baselines, with notable reductions in false alarm rate. The approach demonstrates universality across decoding strategies and datasets, indicating practical impact for robust, scalable ASR error correction without heavy data dependence.

Abstract

Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER). Previous works usually adopt end-to-end models and has strong dependency on Pseudo Paired Data and Original Paired Data. But when only pre-training on Pseudo Paired Data, previous models have negative effect on correction. While fine-tuning on Original Paired Data, the source side data must be transcribed by a well-trained ASR model, which takes a lot of time and not universal. In this paper, we propose UCorrect, an unsupervised Detector-Generator-Selector framework for ASR Error Correction. UCorrect has no dependency on the training data mentioned before. The whole procedure is first to detect whether the character is erroneous, then to generate some candidate characters and finally to select the most confident one to replace the error character. Experiments on the public AISHELL-1 dataset and WenetSpeech dataset show the effectiveness of UCorrect for ASR error correction: 1) it achieves significant WER reduction, achieves 6.83\% even without fine-tuning and 14.29\% after fine-tuning; 2) it outperforms the popular NAR correction models by a large margin with a competitive low latency; and 3) it is an universal method, as it reduces all WERs of the ASR model with different decoding strategies and reduces all WERs of ASR models trained on different scale datasets.
Paper Structure (15 sections, 3 equations, 1 figure, 4 tables)