ABHINAYA -- A System for Speech Emotion Recognition In Naturalistic Conditions Challenge
Soumya Dutta, Smruthi Balaji, Varada R, Viveka Salinamakki, Sriram Ganapathy
TL;DR
Abhinaya addresses SER under naturalistic conditions by integrating speech-based, text-based, and speech-text models built on self-supervised and large language models. It evaluates three model families (S1/S2, T1/T2, ST1) and employs loss functions tailored for imbalanced data, with predictions fused via majority voting. On MSP-PODCAST data, the ensemble achieves competitive results, with post-challenge tuning delivering state-of-the-art performance, highlighting the value of multimodal cues and loss-function design. The study provides a practical framework for robust SER in real-world scenarios and informs future multimodal SER systems under challenging data distributions.
Abstract
Speech emotion recognition (SER) in naturalistic settings remains a challenge due to the intrinsic variability, diverse recording conditions, and class imbalance. As participants in the Interspeech Naturalistic SER Challenge which focused on these complexities, we present Abhinaya, a system integrating speech-based, text-based, and speech-text models. Our approach fine-tunes self-supervised and speech large language models (SLLM) for speech representations, leverages large language models (LLM) for textual context, and employs speech-text modeling with an SLLM to capture nuanced emotional cues. To combat class imbalance, we apply tailored loss functions and generate categorical decisions through majority voting. Despite one model not being fully trained, the Abhinaya system ranked 4th among 166 submissions. Upon completion of training, it achieved state-of-the-art performance among published results, demonstrating the effectiveness of our approach for SER in real-world conditions.
