EMO-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition
Jiacheng Shi, Hongfei Du, Y. Alicia Hong, Ye Gao
TL;DR
Emo-TTA tackles test-time distribution shifts in SER by a lightweight, training-free approach that incrementally estimates the test-time distribution. It couples an audio–language model (CLAP) with a Gaussian discriminant analysis in the embedding space and updates class-conditional statistics $\{\mu_y, \Sigma, \pi_y\}$ on a per-sample basis via an EM procedure, using ALM-derived priors and entropy-based confidence weighting. The method avoids any gradient-based updates or access to source data, and combines ALM zero-shot predictions with a generative model term to produce robust per-sample predictions under drift. Empirical results across six out-of-domain SER benchmarks show consistent gains over prior TTA methods and strong performance relative to foundation ALMs, highlighting the practicality and effectiveness of statistical adaptation for SER in real-world settings.
Abstract
Speech emotion recognition (SER) with audio-language models (ALMs) remains vulnerable to distribution shifts at test time, leading to performance degradation in out-of-domain scenarios. Test-time adaptation (TTA) provides a promising solution but often relies on gradient-based updates or prompt tuning, limiting flexibility and practicality. We propose Emo-TTA, a lightweight, training-free adaptation framework that incrementally updates class-conditional statistics via an Expectation-Maximization procedure for explicit test-time distribution estimation, using ALM predictions as priors. Emo-TTA operates on individual test samples without modifying model weights. Experiments on six out-of-domain SER benchmarks show consistent accuracy improvements over prior TTA baselines, demonstrating the effectiveness of statistical adaptation in aligning model predictions with evolving test distributions.
