Table of Contents
Fetching ...

ABHINAYA -- A System for Speech Emotion Recognition In Naturalistic Conditions Challenge

Soumya Dutta, Smruthi Balaji, Varada R, Viveka Salinamakki, Sriram Ganapathy

TL;DR

Abhinaya addresses SER under naturalistic conditions by integrating speech-based, text-based, and speech-text models built on self-supervised and large language models. It evaluates three model families (S1/S2, T1/T2, ST1) and employs loss functions tailored for imbalanced data, with predictions fused via majority voting. On MSP-PODCAST data, the ensemble achieves competitive results, with post-challenge tuning delivering state-of-the-art performance, highlighting the value of multimodal cues and loss-function design. The study provides a practical framework for robust SER in real-world scenarios and informs future multimodal SER systems under challenging data distributions.

Abstract

Speech emotion recognition (SER) in naturalistic settings remains a challenge due to the intrinsic variability, diverse recording conditions, and class imbalance. As participants in the Interspeech Naturalistic SER Challenge which focused on these complexities, we present Abhinaya, a system integrating speech-based, text-based, and speech-text models. Our approach fine-tunes self-supervised and speech large language models (SLLM) for speech representations, leverages large language models (LLM) for textual context, and employs speech-text modeling with an SLLM to capture nuanced emotional cues. To combat class imbalance, we apply tailored loss functions and generate categorical decisions through majority voting. Despite one model not being fully trained, the Abhinaya system ranked 4th among 166 submissions. Upon completion of training, it achieved state-of-the-art performance among published results, demonstrating the effectiveness of our approach for SER in real-world conditions.

ABHINAYA -- A System for Speech Emotion Recognition In Naturalistic Conditions Challenge

TL;DR

Abhinaya addresses SER under naturalistic conditions by integrating speech-based, text-based, and speech-text models built on self-supervised and large language models. It evaluates three model families (S1/S2, T1/T2, ST1) and employs loss functions tailored for imbalanced data, with predictions fused via majority voting. On MSP-PODCAST data, the ensemble achieves competitive results, with post-challenge tuning delivering state-of-the-art performance, highlighting the value of multimodal cues and loss-function design. The study provides a practical framework for robust SER in real-world scenarios and informs future multimodal SER systems under challenging data distributions.

Abstract

Speech emotion recognition (SER) in naturalistic settings remains a challenge due to the intrinsic variability, diverse recording conditions, and class imbalance. As participants in the Interspeech Naturalistic SER Challenge which focused on these complexities, we present Abhinaya, a system integrating speech-based, text-based, and speech-text models. Our approach fine-tunes self-supervised and speech large language models (SLLM) for speech representations, leverages large language models (LLM) for textual context, and employs speech-text modeling with an SLLM to capture nuanced emotional cues. To combat class imbalance, we apply tailored loss functions and generate categorical decisions through majority voting. Despite one model not being fully trained, the Abhinaya system ranked 4th among 166 submissions. Upon completion of training, it achieved state-of-the-art performance among published results, demonstrating the effectiveness of our approach for SER in real-world conditions.

Paper Structure

This paper contains 24 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Schematic of the different components of the Abhinaya SER system. We use three types of models - speech-only (S1, S2), text-only (T1, T2) and speech-text (ST1). Only T1 is used in a zero-shot setting. The text used by the models are ASR transcripts generated by Whisper radford2023robust.
  • Figure 2: Validation macro F1-score (in $\%$) for different LLMs dubey2024llamaguo2025deepseekteam2023geminiachiam2023gpt evaluated in zero-shot setting using the ASR transcripts. The LLaMA models considered are the Instruct versions. The baseline performance is also shown.