Table of Contents
Fetching ...

E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models

Jiaheng Dong, Hong Jia, Soumyajit Chatterjee, Abhirup Ghosh, James Bailey, Ting Dang

TL;DR

This work tackles the fragility of Speech Foundation Models under real-world acoustic domain shifts and the high memory cost of existing test-time adaptation methods. It introduces E-BATS, a backpropagation-free TTA framework for SFMs that combines lightweight prompt adaptation integrated into CNN features, a multi-scale loss (entropy, utterance-level, and token-level alignment), and a test-time exponential moving average (T-EMA) to stabilize learning across utterances. Optimization relies on CMA-ES to search per-utterance prompts, enabling adaptation without gradients, while aligning latent distributions at multiple scales and using adaptive token confidence to mitigate unreliable pseudo-labels. Empirical results across four noisy datasets and two backbone models show strong WER gains over BP-free baselines and substantial memory savings compared to BP-based methods, with robust performance across utterance lengths and domain shifts. This approach paves the way for scalable, efficient on-device adaptation of speech foundation models in realistic, resource-constrained environments.

Abstract

Speech Foundation Models encounter significant performance degradation when deployed in real-world scenarios involving acoustic domain shifts, such as background noise and speaker accents. Test-time adaptation (TTA) has recently emerged as a viable strategy to address such domain shifts at inference time without requiring access to source data or labels. However, existing TTA approaches, particularly those relying on backpropagation, are memory-intensive, limiting their applicability in speech tasks and resource-constrained settings. Although backpropagation-free methods offer improved efficiency, existing ones exhibit poor accuracy. This is because they are predominantly developed for vision tasks, which fundamentally differ from speech task formulations, noise characteristics, and model architecture, posing unique transferability challenges. In this paper, we introduce E-BATS, the first Efficient BAckpropagation-free TTA framework designed explicitly for speech foundation models. E-BATS achieves a balance between adaptation effectiveness and memory efficiency through three key components: (i) lightweight prompt adaptation for a forward-pass-based feature alignment, (ii) a multi-scale loss to capture both global (utterance-level) and local distribution shifts (token-level) and (iii) a test-time exponential moving average mechanism for stable adaptation across utterances. Experiments conducted on four noisy speech datasets spanning sixteen acoustic conditions demonstrate consistent improvements, with 4.1%-13.5% accuracy gains over backpropagation-free baselines and 2.0-6.4 times GPU memory savings compared to backpropagation-based methods. By enabling scalable and robust adaptation under acoustic variability, this work paves the way for developing more efficient adaptation approaches for practical speech processing systems in real-world environments.

E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models

TL;DR

This work tackles the fragility of Speech Foundation Models under real-world acoustic domain shifts and the high memory cost of existing test-time adaptation methods. It introduces E-BATS, a backpropagation-free TTA framework for SFMs that combines lightweight prompt adaptation integrated into CNN features, a multi-scale loss (entropy, utterance-level, and token-level alignment), and a test-time exponential moving average (T-EMA) to stabilize learning across utterances. Optimization relies on CMA-ES to search per-utterance prompts, enabling adaptation without gradients, while aligning latent distributions at multiple scales and using adaptive token confidence to mitigate unreliable pseudo-labels. Empirical results across four noisy datasets and two backbone models show strong WER gains over BP-free baselines and substantial memory savings compared to BP-based methods, with robust performance across utterance lengths and domain shifts. This approach paves the way for scalable, efficient on-device adaptation of speech foundation models in realistic, resource-constrained environments.

Abstract

Speech Foundation Models encounter significant performance degradation when deployed in real-world scenarios involving acoustic domain shifts, such as background noise and speaker accents. Test-time adaptation (TTA) has recently emerged as a viable strategy to address such domain shifts at inference time without requiring access to source data or labels. However, existing TTA approaches, particularly those relying on backpropagation, are memory-intensive, limiting their applicability in speech tasks and resource-constrained settings. Although backpropagation-free methods offer improved efficiency, existing ones exhibit poor accuracy. This is because they are predominantly developed for vision tasks, which fundamentally differ from speech task formulations, noise characteristics, and model architecture, posing unique transferability challenges. In this paper, we introduce E-BATS, the first Efficient BAckpropagation-free TTA framework designed explicitly for speech foundation models. E-BATS achieves a balance between adaptation effectiveness and memory efficiency through three key components: (i) lightweight prompt adaptation for a forward-pass-based feature alignment, (ii) a multi-scale loss to capture both global (utterance-level) and local distribution shifts (token-level) and (iii) a test-time exponential moving average mechanism for stable adaptation across utterances. Experiments conducted on four noisy speech datasets spanning sixteen acoustic conditions demonstrate consistent improvements, with 4.1%-13.5% accuracy gains over backpropagation-free baselines and 2.0-6.4 times GPU memory savings compared to backpropagation-based methods. By enabling scalable and robust adaptation under acoustic variability, this work paves the way for developing more efficient adaptation approaches for practical speech processing systems in real-world environments.

Paper Structure

This paper contains 57 sections, 7 equations, 6 figures, 17 tables, 2 algorithms.

Figures (6)

  • Figure 1: The main difference between (a) Vision Foundation Models and (b) Speech Foundation Models (SFMs) is the sequential pipeline in SFMs that processes a fixed-length frame of an utterance as an input and maps to a distribution over $|\mathcal{V}|$ token classes.
  • Figure 2: Overall framework of E-BATS. For an utterance $\bm X_t$: (i) Lightweight Prompt Adaptation (LPA): CNN-extracted latent features $\bm{Z}_t$ are adapted using a set of $\bm{J}$ candidate prompts $\bm s_{t,j}$ generated by CMA-ES in parallel, leading to $\bm{J}$ adapted representations. (ii) The adapted representations $\mathbf{J}$ are evaluated, and their corresponding prompts are ranked using a multi-scale loss (entropy loss, utterance-level and token-level feature alignment). This ranking guides the iterative update of CMA-ES parameters over $\mathbf{K}$ iterations until the loss converges, at which point the best prompt is selected for adaptation. The CMA-ES parameters are smoothed using T-EMA for next utterance adaption. (b) Test-time Exponential Moving Average (T-EMA): T-EMA stabilizes adaptation by smoothing the CMA-ES search trajectory across a stream of utterances, facilitating robust prompt learning.
  • Figure 3: Comparing the source and target latent spaces across different acoustic conditions (same sample size for source and target domain within each condition). Blue and red bars indicate the mean and covariance shifts.
  • Figure 4: Balance between Average Peak GPU memory usage (bar) and average WER (percentages $\%$) for different TTA methods across all datasets.
  • Figure 5: Peak GPU memory of TTA on TED as audio duration increases.
  • ...and 1 more figures