Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting
Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bulyko, Andreas Stolcke
TL;DR
The paper addresses improving ASR output by using large language models as post-processors for rescoring and error correction without full fine-tuning. It introduces two pipelines: Pipeline 1 combines LLM-driven error correction with a standard rescoring model, while Pipeline 2 uses Task-Activating Prompting (TAP) to prompt frozen LLMs to perform rescoring directly, leveraging zero-/few-shot in-context learning. The study compares multiple prompting strategies, including zero-shot domain hints and chain-of-thought reasoning, as well as a Hypotheses-to-Transcription loss to enable limited fine-tuning with adapters, showing that LLMs can outperform domain-tuned LMs on ATIS and WSJ when properly prompted. Findings indicate that larger models (e.g., InstructGPT) with TAP and reasoning prompts yield substantial WER reductions, with additional gains from few-shot demonstrations and parameter-efficient adapters; even without fine-tuning, the LLMs demonstrate strong generalization, though results remain below the N-best oracle. The work highlights the practical potential of cloud-based LLM post-processing for ASR and opens paths to further improvements by integrating acoustic representations into LLMs.
Abstract
We explore the ability of large language models (LLMs) to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines causal instructions and demonstration to increase its context windows. Next, we show that rescoring only by in-context learning with frozen LLMs achieves results that are competitive with rescoring by domain-tuned LMs, using a pretrained first-pass recognition system and rescoring output on two out-of-domain tasks (ATIS and WSJ). By combining prompting techniques with fine-tuning we achieve error rates below the N-best oracle level, showcasing the generalization power of the LLMs.
