Table of Contents
Fetching ...

LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition

Eunseop Yoon, Hee Suk Yoon, John Harvill, Mark Hasegawa-Johnson, Chang D. Yoo

TL;DR

The paper tackles domain shift in ASR by addressing a key gap in Test-Time Adaptation: the lack of linguistic supervision. It introduces LI-TTA, which injects corrections from an external instruction-tuned language model into TTA by jointly minimizing the TTA loss and a CTC loss on LM-corrected transcripts, formalized as $L = L_{ ext{TTA}} + \lambda_{ ext{LI}} L_{ ext{CTC}}(\tilde{\mathbf{y}}, \mathbf{y})$. Across diverse domain-shift benchmarks, LI-TTA yields lower word error rates and perplexity scores than prior TTA methods, including non-native speech and noisy conditions. This approach demonstrates how linguistic feedback can be backpropagated into ASR adaptation without requiring labeled target data, broadening the practical impact of TTA for robust speech recognition.

Abstract

Test-Time Adaptation (TTA) has emerged as a crucial solution to the domain shift challenge, wherein the target environment diverges from the original training environment. A prime exemplification is TTA for Automatic Speech Recognition (ASR), which enhances model performance by leveraging output prediction entropy minimization as a self-supervision signal. However, a key limitation of this self-supervision lies in its primary focus on acoustic features, with minimal attention to the linguistic properties of the input. To address this gap, we propose Language Informed Test-Time Adaptation (LI-TTA), which incorporates linguistic insights during TTA for ASR. LI-TTA integrates corrections from an external language model to merge linguistic with acoustic information by minimizing the CTC loss from the correction alongside the standard TTA loss. With extensive experiments, we show that LI-TTA effectively improves the performance of TTA for ASR in various distribution shift situations.

LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition

TL;DR

The paper tackles domain shift in ASR by addressing a key gap in Test-Time Adaptation: the lack of linguistic supervision. It introduces LI-TTA, which injects corrections from an external instruction-tuned language model into TTA by jointly minimizing the TTA loss and a CTC loss on LM-corrected transcripts, formalized as . Across diverse domain-shift benchmarks, LI-TTA yields lower word error rates and perplexity scores than prior TTA methods, including non-native speech and noisy conditions. This approach demonstrates how linguistic feedback can be backpropagated into ASR adaptation without requiring labeled target data, broadening the practical impact of TTA for robust speech recognition.

Abstract

Test-Time Adaptation (TTA) has emerged as a crucial solution to the domain shift challenge, wherein the target environment diverges from the original training environment. A prime exemplification is TTA for Automatic Speech Recognition (ASR), which enhances model performance by leveraging output prediction entropy minimization as a self-supervision signal. However, a key limitation of this self-supervision lies in its primary focus on acoustic features, with minimal attention to the linguistic properties of the input. To address this gap, we propose Language Informed Test-Time Adaptation (LI-TTA), which incorporates linguistic insights during TTA for ASR. LI-TTA integrates corrections from an external language model to merge linguistic with acoustic information by minimizing the CTC loss from the correction alongside the standard TTA loss. With extensive experiments, we show that LI-TTA effectively improves the performance of TTA for ASR in various distribution shift situations.
Paper Structure (16 sections, 3 equations, 4 figures, 2 tables)

This paper contains 16 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Examples of failure cases with traditional TTA methods in ASR due to the absence of linguistic feedback. (a) Predictions with phonetically similar words persist incorrectly, misaligning with the sentence context. (b) Additionally, some accurate predictions are incorrectly replaced with contextually unfit words of similar phonetic structure.
  • Figure 2: An overview of our proposed Language Informed Test-Time Adaptation (LI-TTA). LI-TTA integrates corrections from an external instruction-tuned language model to merge linguistic with acoustic information by minimizing the CTC loss from the correction alongside the standard TTA loss.
  • Figure 3: The trend of perplexity (PPL) with respect to the TTA adaptation steps. Traditional TTA approach (SGEM sgem) shows only marginal reductions in perplexity due to the absence of linguistic feedback during the adaptation process. Conversely, our proposed LI-TTA demonstrates a significant decrease in perplexity, benefiting from the integration of linguistic insights.
  • Figure 4: Examples of failure cases of previous TTA method (SGEM sgem) and the prediction from our proposed LI-TTA with contextually correct transcriptions.