Table of Contents
Fetching ...

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly

TL;DR

This work tackles the gap that traditional language models have in handling ASR errors by introducing Denoising LM (DLM), a scaled error-correction model trained with synthetic data generated via Text-to-Speech (TTS) and ASR systems. The approach leverages multi-speaker TTS, diverse noise augmentations, and novel decoding (DSR-decoding) to learn a robust denoising distribution $p_{DLM}(y|\hat{y})$, enabling greedy decoding or beam-like correction without heavy external audio data. Empirically, DLM achieves state-of-the-art WER on LibriSpeech (e.g., $1.5\%$ test-clean, $3.3\%$ test-other) and generalizes across ASR architectures and domains, while demonstrating scalability across model size, data, and speaker diversity. The results suggest error-correction models can surpass traditional neural LMs in ASR and offer practical, efficient improvements for real-world transcription systems, with careful data-distribution design and diverse noise becoming key factors for success.

Abstract

Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a $\textit{scaled}$ error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several $\textit{key ingredients}$: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on $\textit{test-clean}$ and 3.3% WER on $\textit{test-other}$ on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

TL;DR

This work tackles the gap that traditional language models have in handling ASR errors by introducing Denoising LM (DLM), a scaled error-correction model trained with synthetic data generated via Text-to-Speech (TTS) and ASR systems. The approach leverages multi-speaker TTS, diverse noise augmentations, and novel decoding (DSR-decoding) to learn a robust denoising distribution , enabling greedy decoding or beam-like correction without heavy external audio data. Empirically, DLM achieves state-of-the-art WER on LibriSpeech (e.g., test-clean, test-other) and generalizes across ASR architectures and domains, while demonstrating scalability across model size, data, and speaker diversity. The results suggest error-correction models can surpass traditional neural LMs in ASR and offer practical, efficient improvements for real-world transcription systems, with careful data-distribution design and diverse noise becoming key factors for success.

Abstract

Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several : (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on and 3.3% WER on on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.
Paper Structure (38 sections, 6 equations, 1 figure, 13 tables)

This paper contains 38 sections, 6 equations, 1 figure, 13 tables.

Figures (1)

  • Figure 1: Conventional LM decoding and the proposed DSR-decoding. Left panel: given an input audio, an $n$-best list of the hypotheses from the beam-search decoding of the ASR model is generated with a $N$-gram word-level LM, which is then rescored using the neural LM scores and ASR scores. Right panel: given an input audio, ASR generates its greedy hypothesis, which is fed into DLM and it generates an $n$-best list of the corrected hypotheses, which are then rescored using both DLM scores and ASR scores.