Table of Contents
Fetching ...

Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation

Sreyan Ghosh, Mohammad Sadegh Rasooli, Michael Levit, Peidong Wang, Jian Xue, Dinesh Manocha, Jinyu Li

TL;DR

DARAG (Data- and Retrieval-Augmented Generative Error Correction) is proposed, a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios and introduces retrieval-augmented correction by augmenting the input with entities retrieved from a database.

Abstract

Generative Error Correction (GEC) has emerged as a powerful post-processing method to enhance the performance of Automatic Speech Recognition (ASR) systems. However, we show that GEC models struggle to generalize beyond the specific types of errors encountered during training, limiting their ability to correct new, unseen errors at test time, particularly in out-of-domain (OOD) scenarios. This phenomenon amplifies with named entities (NEs), where, in addition to insufficient contextual information or knowledge about the NEs, novel NEs keep emerging. To address these issues, we propose DARAG (Data- and Retrieval-Augmented Generative Error Correction), a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios. We augment the GEC training dataset with synthetic data generated by prompting LLMs and text-to-speech models, thereby simulating additional errors from which the model can learn. For OOD scenarios, we simulate test-time errors from new domains similarly and in an unsupervised fashion. Additionally, to better handle named entities, we introduce retrieval-augmented correction by augmenting the input with entities retrieved from a database. Our approach is simple, scalable, and both domain- and language-agnostic. We experiment on multiple datasets and settings, showing that DARAG outperforms all our baselines, achieving 8\% -- 30\% relative WER improvements in ID and 10\% -- 33\% improvements in OOD settings.

Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation

TL;DR

DARAG (Data- and Retrieval-Augmented Generative Error Correction) is proposed, a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios and introduces retrieval-augmented correction by augmenting the input with entities retrieved from a database.

Abstract

Generative Error Correction (GEC) has emerged as a powerful post-processing method to enhance the performance of Automatic Speech Recognition (ASR) systems. However, we show that GEC models struggle to generalize beyond the specific types of errors encountered during training, limiting their ability to correct new, unseen errors at test time, particularly in out-of-domain (OOD) scenarios. This phenomenon amplifies with named entities (NEs), where, in addition to insufficient contextual information or knowledge about the NEs, novel NEs keep emerging. To address these issues, we propose DARAG (Data- and Retrieval-Augmented Generative Error Correction), a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios. We augment the GEC training dataset with synthetic data generated by prompting LLMs and text-to-speech models, thereby simulating additional errors from which the model can learn. For OOD scenarios, we simulate test-time errors from new domains similarly and in an unsupervised fashion. Additionally, to better handle named entities, we introduce retrieval-augmented correction by augmenting the input with entities retrieved from a database. Our approach is simple, scalable, and both domain- and language-agnostic. We experiment on multiple datasets and settings, showing that DARAG outperforms all our baselines, achieving 8\% -- 30\% relative WER improvements in ID and 10\% -- 33\% improvements in OOD settings.

Paper Structure

This paper contains 28 sections, 1 equation, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Comparison of traditional GEC and DARAG. We augment the training dataset with synthetic data generated using our algorithm and named entities retrieved from a datastore to improve in-domain and out-of-domain ASR.
  • Figure 2: Illustration of DARAG. ① We generate synthetic data with LLMs and TTS models that are then used to generate hypotheses with diverse errors consistent with the types the ASR model generates on the test set. ② We extract the NEs and store them in a datastore. During training, for every instance, we retrieve the top-k most similar NEs to the best hypothesis and use it to construct an instruction-response pair. Note that in OOD settings we only assume the availability of only a few unsupervised speech samples in the original train set and pseudo-transcripts for prompting are generated using the in-domain ASR model.
  • Figure 3: Comparison of DARAG with other methods on low-resource source-free UDA (LS $\rightarrow$ Vox). DARAG outperforms other methods with significant improvements.