E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models
Jiaheng Dong, Hong Jia, Soumyajit Chatterjee, Abhirup Ghosh, James Bailey, Ting Dang
TL;DR
This work tackles the fragility of Speech Foundation Models under real-world acoustic domain shifts and the high memory cost of existing test-time adaptation methods. It introduces E-BATS, a backpropagation-free TTA framework for SFMs that combines lightweight prompt adaptation integrated into CNN features, a multi-scale loss (entropy, utterance-level, and token-level alignment), and a test-time exponential moving average (T-EMA) to stabilize learning across utterances. Optimization relies on CMA-ES to search per-utterance prompts, enabling adaptation without gradients, while aligning latent distributions at multiple scales and using adaptive token confidence to mitigate unreliable pseudo-labels. Empirical results across four noisy datasets and two backbone models show strong WER gains over BP-free baselines and substantial memory savings compared to BP-based methods, with robust performance across utterance lengths and domain shifts. This approach paves the way for scalable, efficient on-device adaptation of speech foundation models in realistic, resource-constrained environments.
Abstract
Speech Foundation Models encounter significant performance degradation when deployed in real-world scenarios involving acoustic domain shifts, such as background noise and speaker accents. Test-time adaptation (TTA) has recently emerged as a viable strategy to address such domain shifts at inference time without requiring access to source data or labels. However, existing TTA approaches, particularly those relying on backpropagation, are memory-intensive, limiting their applicability in speech tasks and resource-constrained settings. Although backpropagation-free methods offer improved efficiency, existing ones exhibit poor accuracy. This is because they are predominantly developed for vision tasks, which fundamentally differ from speech task formulations, noise characteristics, and model architecture, posing unique transferability challenges. In this paper, we introduce E-BATS, the first Efficient BAckpropagation-free TTA framework designed explicitly for speech foundation models. E-BATS achieves a balance between adaptation effectiveness and memory efficiency through three key components: (i) lightweight prompt adaptation for a forward-pass-based feature alignment, (ii) a multi-scale loss to capture both global (utterance-level) and local distribution shifts (token-level) and (iii) a test-time exponential moving average mechanism for stable adaptation across utterances. Experiments conducted on four noisy speech datasets spanning sixteen acoustic conditions demonstrate consistent improvements, with 4.1%-13.5% accuracy gains over backpropagation-free baselines and 2.0-6.4 times GPU memory savings compared to backpropagation-based methods. By enabling scalable and robust adaptation under acoustic variability, this work paves the way for developing more efficient adaptation approaches for practical speech processing systems in real-world environments.
