It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

Chen Chen; Ruizhe Li; Yuchen Hu; Sabato Marco Siniscalchi; Pin-Yu Chen; Ensiong Chng; Chao-Han Huck Yang

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng, Chao-Han Huck Yang

TL;DR

The paper addresses the data uncertainty that arises when large language models (LLMs) are used to perform generative error correction (GER) for automatic speech recognition (ASR) without leveraging acoustic information. It introduces Uncertainty-Aware Dynamic Fusion (UADF), a two-stage late-fusion framework that calibrates token-level LLM decisions and dynamically incorporates acoustic cues during autoregressive decoding, guided by LLM uncertainty. Empirical results across clean and noisy ASR tasks, as well as audio-visual speech recognition (AVSR), show substantial word error rate (WER) improvements over GER baselines and strong generalization to multimodal settings. The approach is training-efficient and renders a plug-in solution for enhancing LLM-based GER with acoustic information, with potential broad impact on multimodal ASR systems.

Abstract

Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct mapping from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). UADF is a multimodal fusion approach implemented into an auto-regressive decoding process and works in two stages: (i) It first analyzes and calibrates the token-level LLM decision, and (ii) it then dynamically assimilates the information from the acoustic modality. Experimental evidence collected from various ASR tasks shows that UADF surpasses existing fusion mechanisms in several ways. It yields significant improvements in word error rate (WER) while mitigating data uncertainty issues in LLM and addressing the poor generalization relied with sole modality during fusion. We also demonstrate that UADF seamlessly adapts to audio-visual speech recognition.

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

TL;DR

Abstract

Paper Structure (21 sections, 10 equations, 4 figures, 6 tables)

This paper contains 21 sections, 10 equations, 4 figures, 6 tables.

Introduction
Related Work
A Generative Framework of ASR Error Correction
Acoustic Information Fusion
Fusion Strategy
Calibration
Uncertainty-aware dynamic fusion
Experiment
Dataset
Setup
Result and Analysis
Effect of fusion strategies
Effect of UADF
Generalization of UADF
Conclusion
...and 6 more sections

Figures (4)

Figure 1: Different fusion strategies: early, mid and late fusions. The the green area indicates where the fusion strategies happened. N-best List is generated by ASR engine with beam search decoding. Left: the speech tokens extracted from the acoustic encoder are directly concatenated with the corresponding word embeddings of the N-best list before feeding into the LLMs; Middle: the acoustic features from the last layer of the acoustic encoder are integrated into the LLMs decoding process using the cross-attention mechanism; Right: the step-wise fusion happens in the auto-regressive decoding process by integrating both decision-level information.
Figure 2: Case study on a high uncertainty ($\mathcal{U}_t^{llm}$ is 9.91) example. Top-2 candidates from LLM are displayed while the "how" is a wrong prediction. UADF corrects the results to "all" according to the decision of the ASR model.
Figure 3: The visualization of before (left) and after (right) calibration for LLM in UADF. The dashed line represents the ideal relationship where confidence and accuracy are perfectly matched. The blue bar indicates the token distribution under different LLM's confidence intervals, and "$\times$" indicates the actual average accuracy based on each confidence interval.
Figure 4: Modality laziness reported in chen2023leveraging, where SNR level denotes the quality of speech.

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

TL;DR

Abstract

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (4)