Table of Contents
Fetching ...

Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Rohit Kumar, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner

TL;DR

The paper tackles ASR transcription errors by introducing Whispering LLaMA, a cross-modal generative error-correction framework that fuses acoustic representations from Whisper with linguistic context from LLaMA via residual adapters. It replaces traditional two-pass rescoring with a token-level fusion mechanism that enables LLaMA to generate corrected transcripts conditioned on both audio and n-best hypotheses. The authors demonstrate a substantial WERR improvement (up to 37.66%) across ATIS and GigaSpeech subsets, perform thorough ablations to validate the design, and release open-source code and models for reproducibility. While offering a parameter-efficient fusion of large pre-trained models, they acknowledge high compute costs and data demands as limitations and propose future integration into broader ASR ecosystems.

Abstract

We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we evaluate the stability and reproducibility of our fusion technique, demonstrating its improved word error rate relative (WERR) performance in comparison to n-best hypotheses by relatively 37.66%. To encourage future research, we have made our code and pre-trained models open source at https://github.com/Srijith-rkr/Whispering-LLaMA.

Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

TL;DR

The paper tackles ASR transcription errors by introducing Whispering LLaMA, a cross-modal generative error-correction framework that fuses acoustic representations from Whisper with linguistic context from LLaMA via residual adapters. It replaces traditional two-pass rescoring with a token-level fusion mechanism that enables LLaMA to generate corrected transcripts conditioned on both audio and n-best hypotheses. The authors demonstrate a substantial WERR improvement (up to 37.66%) across ATIS and GigaSpeech subsets, perform thorough ablations to validate the design, and release open-source code and models for reproducibility. While offering a parameter-efficient fusion of large pre-trained models, they acknowledge high compute costs and data demands as limitations and propose future integration into broader ASR ecosystems.

Abstract

We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we evaluate the stability and reproducibility of our fusion technique, demonstrating its improved word error rate relative (WERR) performance in comparison to n-best hypotheses by relatively 37.66%. To encourage future research, we have made our code and pre-trained models open source at https://github.com/Srijith-rkr/Whispering-LLaMA.
Paper Structure (23 sections, 7 equations, 5 figures, 5 tables)

This paper contains 23 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of proposed generative ASR error correction with a trainable token ($M_{L}$) and fusion mechanism inside a self-attention layer described in Section \ref{['Adapter Mechanism']}. A detailed model-wise illustration is discussed in Fig \ref{['fig:model_mech']}.
  • Figure 2: Whispering-LLaMa model-overview of proposed adaptation pipelines described in Section \ref{['Adapter Mechanism']}
  • Figure 3: Train loss of $\mathcal{WL}_M$ (Row 3) vs $\mathcal{WL}_M$ without initialization (Row 7) on the Entertainment dataset
  • Figure 4: Train loss of $\mathcal{WL}_M$ (Row 3) vs $\mathcal{WL}_M$ without audio representations (Row 6) on the Entertainment dataset
  • Figure 5: Illustration of the Alpaca prompt template used in our proposed framework