Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

Yuchen Hu; Chen Chen; Chengwei Qin; Qiushi Zhu; Eng Siong Chng; Ruizhe Li

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

Yuchen Hu, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng, Ruizhe Li

TL;DR

This paper introduces ClozeGER, a multimodal generative error correction framework for ASR that uses SpeechGPT to access source speech and reformulates correction as a cloze test with logits calibration to reduce input redundancy. By combining a speech-grounded H2T model with a debiased cloze format and a post-processing pass, ClozeGER achieves substantial WER reductions across 9 HyPoradise datasets, outperforming vanilla GER in many cases and approaching or surpassing the N-best oracle on several domains. Key contributions include the multimodal correction paradigm, a principled logits-calibration technique to mitigate option-bias, and a practical post-processing stage, all validated through extensive ablations and analyses. The work has practical impact for robust ASR post-processing, especially in multilingual and noisy settings where N-best diversity is limited.

Abstract

Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suffers from two limitations: 1) LLMs are unaware of the source speech during GER, which may lead to results that are grammatically correct but violate the source speech content, 2) N-best hypotheses usually only vary in a few tokens, making it redundant to send all of them for GER, which could confuse LLM about which tokens to focus on and thus lead to increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. First, we introduce a multimodal LLM (i.e., SpeechGPT) to receive source speech as extra input to improve the fidelity of correction output. Then, we reformat GER as a cloze test with logits calibration to remove the input information redundancy and simplify GER with clear instructions. Experiments show that ClozeGER achieves a new breakthrough over vanilla GER on 9 popular ASR datasets.

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

TL;DR

Abstract

Paper Structure (19 sections, 7 equations, 5 figures, 8 tables)

This paper contains 19 sections, 7 equations, 5 figures, 8 tables.

Introduction
Related Work
Methodology
Preliminary: Generative Error Correction
GER with Source Speech
ClozeGER
Cloze Format
Logits Calibration with Prior Estimate
Post-processing
Experiments
Setup
Main Results
Ablation Study and Analysis
Conclusion
HyPoradise Dataset Details
...and 4 more sections

Figures (5)

Figure 1: Two limitations of generative error correction chen2023hp. Left: violate source speech, LLM removes the word "Think" in first two hypotheses as it rarely appears at the beginning of a sentence and followed by a subject according to grammar, but this actually happens in the source speech. Right: information redundancy in N-best hypotheses input, there is only one difference between N-best candidates, making it redundant to send all of them for GER, which confuses LLM about which tokens to focus on for correction.
Figure 2: Frameworks of (a) vanilla GER that employs N-best hypotheses to predict ground-truth transcription, (b) GER with source speech as extra input to improve the fidelity of correction output, (c) our ClozeGER that reformats GER as a cloze test with logits calibration, followed by a post-processing stage to further correct the cloze context.
Figure 3: Distribution and cloze accuracy of five options with logits calibration on WSJ dataset. "Dist." denotes the distribution of five options in the predictions, "Acc." denotes the predicting accuracy of each ground-truth option, and "w/ Calib." denotes with logits calibration.
Figure 4: WER (%) results of utilizing different numbers of validation samples for prior estimation. The minimum required amount to obtain the best performance is highlighted in the star mark.
Figure 5: Distribution and cloze accuracy of five options with and without logits calibration on TED-LIUM 3 dataset. The remarks follow that in Fig. \ref{['fig3']}.

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

TL;DR

Abstract

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)