Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

Chao-Han Huck Yang; Yile Gu; Yi-Chieh Liu; Shalini Ghosh; Ivan Bulyko; Andreas Stolcke

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bulyko, Andreas Stolcke

TL;DR

The paper addresses improving ASR output by using large language models as post-processors for rescoring and error correction without full fine-tuning. It introduces two pipelines: Pipeline 1 combines LLM-driven error correction with a standard rescoring model, while Pipeline 2 uses Task-Activating Prompting (TAP) to prompt frozen LLMs to perform rescoring directly, leveraging zero-/few-shot in-context learning. The study compares multiple prompting strategies, including zero-shot domain hints and chain-of-thought reasoning, as well as a Hypotheses-to-Transcription loss to enable limited fine-tuning with adapters, showing that LLMs can outperform domain-tuned LMs on ATIS and WSJ when properly prompted. Findings indicate that larger models (e.g., InstructGPT) with TAP and reasoning prompts yield substantial WER reductions, with additional gains from few-shot demonstrations and parameter-efficient adapters; even without fine-tuning, the LLMs demonstrate strong generalization, though results remain below the N-best oracle. The work highlights the practical potential of cloud-based LLM post-processing for ASR and opens paths to further improvements by integrating acoustic representations into LLMs.

Abstract

We explore the ability of large language models (LLMs) to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines causal instructions and demonstration to increase its context windows. Next, we show that rescoring only by in-context learning with frozen LLMs achieves results that are competitive with rescoring by domain-tuned LMs, using a pretrained first-pass recognition system and rescoring output on two out-of-domain tasks (ATIS and WSJ). By combining prompting techniques with fine-tuning we achieve error rates below the N-best oracle level, showcasing the generalization power of the LLMs.

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

TL;DR

Abstract

Paper Structure (16 sections, 3 equations, 5 figures, 4 tables)

This paper contains 16 sections, 3 equations, 5 figures, 4 tables.

Introduction
Related Work
Method
In-Context Learning Background and Techniques
Zero-shot domain-hint prompting
Zero-shot reasoning
Few-shot and one-shot in-context learning
N-best hypotheses to transcription fine-tuning
Task-activating Prompting (TAP) Framework
Experiments and Results
Pretrained ASR and Rescoring Model Training
Pretrained LLM Configurations
Target-Domain Datasets
$\mathcal{P}$ipeline 1 Results
$\mathcal{P}$ipeline 2 Results
...and 1 more sections

Figures (5)

Figure 1: Two ASR post-processing frameworks using LLMs: (a) correct errors (e.g., grammar guo2019spelling) before applying a standard rescoring model, or (b) perform zero/few-shot rescoring; with optional task-activating prompting (Section \ref{['sec:tap']}).
Figure 2: Four LLM in-context learning uses for ASR 2nd pass
Figure 3: Queries (Q) and responses (R) for N-best evaluation and correction by task-activating prompting (TAP) of LLMs
Figure 4: $\mathcal{P}_1$ ASR rescoring (RS) training using hypotheses corrected by LLM. The dashed red line marks the $N$-best WER. The WER gradually decreases in the three stages of rescoring using our $\mathcal{P}_1$ processing: Stage 0, $N$-best hypothesis with LLM correction ($N$C); Stage 1, fine-tuned RescoreBERT xu2022rescorebert using the masked language modeling (MLM) loss; and Stage 2, MWER training.
Figure 5: WER results on ATIS and WSJ with few-shot learning based on InstructGPT, for increasing numbers of demonstration samples. "One-by-one prompting" resets the model history after each utterance, "in-context prompting" lets the history (and thus the examples provided) accumulate.

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

TL;DR

Abstract

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

Authors

TL;DR

Abstract

Table of Contents

Figures (5)