Table of Contents
Fetching ...

EMO-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition

Jiacheng Shi, Hongfei Du, Y. Alicia Hong, Ye Gao

TL;DR

Emo-TTA tackles test-time distribution shifts in SER by a lightweight, training-free approach that incrementally estimates the test-time distribution. It couples an audio–language model (CLAP) with a Gaussian discriminant analysis in the embedding space and updates class-conditional statistics $\{\mu_y, \Sigma, \pi_y\}$ on a per-sample basis via an EM procedure, using ALM-derived priors and entropy-based confidence weighting. The method avoids any gradient-based updates or access to source data, and combines ALM zero-shot predictions with a generative model term to produce robust per-sample predictions under drift. Empirical results across six out-of-domain SER benchmarks show consistent gains over prior TTA methods and strong performance relative to foundation ALMs, highlighting the practicality and effectiveness of statistical adaptation for SER in real-world settings.

Abstract

Speech emotion recognition (SER) with audio-language models (ALMs) remains vulnerable to distribution shifts at test time, leading to performance degradation in out-of-domain scenarios. Test-time adaptation (TTA) provides a promising solution but often relies on gradient-based updates or prompt tuning, limiting flexibility and practicality. We propose Emo-TTA, a lightweight, training-free adaptation framework that incrementally updates class-conditional statistics via an Expectation-Maximization procedure for explicit test-time distribution estimation, using ALM predictions as priors. Emo-TTA operates on individual test samples without modifying model weights. Experiments on six out-of-domain SER benchmarks show consistent accuracy improvements over prior TTA baselines, demonstrating the effectiveness of statistical adaptation in aligning model predictions with evolving test distributions.

EMO-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition

TL;DR

Emo-TTA tackles test-time distribution shifts in SER by a lightweight, training-free approach that incrementally estimates the test-time distribution. It couples an audio–language model (CLAP) with a Gaussian discriminant analysis in the embedding space and updates class-conditional statistics on a per-sample basis via an EM procedure, using ALM-derived priors and entropy-based confidence weighting. The method avoids any gradient-based updates or access to source data, and combines ALM zero-shot predictions with a generative model term to produce robust per-sample predictions under drift. Empirical results across six out-of-domain SER benchmarks show consistent gains over prior TTA methods and strong performance relative to foundation ALMs, highlighting the practicality and effectiveness of statistical adaptation for SER in real-world settings.

Abstract

Speech emotion recognition (SER) with audio-language models (ALMs) remains vulnerable to distribution shifts at test time, leading to performance degradation in out-of-domain scenarios. Test-time adaptation (TTA) provides a promising solution but often relies on gradient-based updates or prompt tuning, limiting flexibility and practicality. We propose Emo-TTA, a lightweight, training-free adaptation framework that incrementally updates class-conditional statistics via an Expectation-Maximization procedure for explicit test-time distribution estimation, using ALM predictions as priors. Emo-TTA operates on individual test samples without modifying model weights. Experiments on six out-of-domain SER benchmarks show consistent accuracy improvements over prior TTA baselines, demonstrating the effectiveness of statistical adaptation in aligning model predictions with evolving test distributions.

Paper Structure

This paper contains 15 sections, 11 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of Emo-TTA for test-time adaptation in SER. Given a test audio, frozen CLAP encoders extract audio and class prototypes, which initialize a EM-based that continuously updates Gaussian parameters via entropy-weighted confidence. Final prediction fuses CLAP’s zero-shot logits with generative scores for stable, training-free adaptation under distribution shift.