Table of Contents
Fetching ...

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng, Chao-Han Huck Yang

TL;DR

The paper addresses the data uncertainty that arises when large language models (LLMs) are used to perform generative error correction (GER) for automatic speech recognition (ASR) without leveraging acoustic information. It introduces Uncertainty-Aware Dynamic Fusion (UADF), a two-stage late-fusion framework that calibrates token-level LLM decisions and dynamically incorporates acoustic cues during autoregressive decoding, guided by LLM uncertainty. Empirical results across clean and noisy ASR tasks, as well as audio-visual speech recognition (AVSR), show substantial word error rate (WER) improvements over GER baselines and strong generalization to multimodal settings. The approach is training-efficient and renders a plug-in solution for enhancing LLM-based GER with acoustic information, with potential broad impact on multimodal ASR systems.

Abstract

Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct mapping from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). UADF is a multimodal fusion approach implemented into an auto-regressive decoding process and works in two stages: (i) It first analyzes and calibrates the token-level LLM decision, and (ii) it then dynamically assimilates the information from the acoustic modality. Experimental evidence collected from various ASR tasks shows that UADF surpasses existing fusion mechanisms in several ways. It yields significant improvements in word error rate (WER) while mitigating data uncertainty issues in LLM and addressing the poor generalization relied with sole modality during fusion. We also demonstrate that UADF seamlessly adapts to audio-visual speech recognition.

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

TL;DR

The paper addresses the data uncertainty that arises when large language models (LLMs) are used to perform generative error correction (GER) for automatic speech recognition (ASR) without leveraging acoustic information. It introduces Uncertainty-Aware Dynamic Fusion (UADF), a two-stage late-fusion framework that calibrates token-level LLM decisions and dynamically incorporates acoustic cues during autoregressive decoding, guided by LLM uncertainty. Empirical results across clean and noisy ASR tasks, as well as audio-visual speech recognition (AVSR), show substantial word error rate (WER) improvements over GER baselines and strong generalization to multimodal settings. The approach is training-efficient and renders a plug-in solution for enhancing LLM-based GER with acoustic information, with potential broad impact on multimodal ASR systems.

Abstract

Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct mapping from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). UADF is a multimodal fusion approach implemented into an auto-regressive decoding process and works in two stages: (i) It first analyzes and calibrates the token-level LLM decision, and (ii) it then dynamically assimilates the information from the acoustic modality. Experimental evidence collected from various ASR tasks shows that UADF surpasses existing fusion mechanisms in several ways. It yields significant improvements in word error rate (WER) while mitigating data uncertainty issues in LLM and addressing the poor generalization relied with sole modality during fusion. We also demonstrate that UADF seamlessly adapts to audio-visual speech recognition.
Paper Structure (21 sections, 10 equations, 4 figures, 6 tables)

This paper contains 21 sections, 10 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Different fusion strategies: early, mid and late fusions. The the green area indicates where the fusion strategies happened. N-best List is generated by ASR engine with beam search decoding. Left: the speech tokens extracted from the acoustic encoder are directly concatenated with the corresponding word embeddings of the N-best list before feeding into the LLMs; Middle: the acoustic features from the last layer of the acoustic encoder are integrated into the LLMs decoding process using the cross-attention mechanism; Right: the step-wise fusion happens in the auto-regressive decoding process by integrating both decision-level information.
  • Figure 2: Case study on a high uncertainty ($\mathcal{U}_t^{llm}$ is 9.91) example. Top-2 candidates from LLM are displayed while the "how" is a wrong prediction. UADF corrects the results to "all" according to the decision of the ASR model.
  • Figure 3: The visualization of before (left) and after (right) calibration for LLM in UADF. The dashed line represents the ideal relationship where confidence and accuracy are perfectly matched. The blue bar indicates the token distribution under different LLM's confidence intervals, and "$\times$" indicates the actual average accuracy based on each confidence interval.
  • Figure 4: Modality laziness reported in chen2023leveraging, where SNR level denotes the quality of speech.