Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model
Chung-Soo Ahn, Rajib Rana, Sunil Sivadas, Carlos Busso, Jagath C. Rajapakse
TL;DR
This work tackles data scarcity in Speech Emotion Recognition by introducing a mutual-information-regularised generative augmentation framework that jointly leverages cross-modal audio-text representations and an InfoGAN-style feature generator. A semantically aligned baseline is trained with contrastive and mutual-information objectives to connect audio and text, followed by a GAN-based generator that maximises $I((c,t); \hat{h})$ to produce emotion-aware features. The approach extends to multimodal settings, enabling the creation of paired $(\hat{h}, \hat{t})$ samples and augmenting classifiers with diverse, emotionally coherent data; MI regularisation provides both a training objective and a measurable quality metric. Evaluations on IEMOCAP, MSP-IMPROV, and MSP-Podcast show consistent improvements over existing baselines in unimodal and multimodal SER, demonstrating the practical impact for data-efficient affective computing.
Abstract
Lack of large, well-annotated emotional speech corpora continues to limit the performance and robustness of speech emotion recognition (SER), particularly as models grow more complex and the demand for multimodal systems increases. While generative data augmentation offers a promising solution, existing approaches often produce emotionally inconsistent samples due to oversimplified conditioning on categorical labels. This paper introduces a novel mutual-information-regularised generative framework that combines cross-modal alignment with feature-level synthesis. Building on an InfoGAN-style architecture, our method first learns a semantically aligned audio-text representation space using pre-trained transformers and contrastive objectives. A feature generator is then trained to produce emotion-aware audio features while employing mutual information as a quantitative regulariser to ensure strong dependency between generated features and their conditioning variables. We extend this approach to multimodal settings, enabling the generation of novel, paired (audio, text) features. Comprehensive evaluation on three benchmark datasets (IEMOCAP, MSP-IMPROV, MSP-Podcast) demonstrates that our framework consistently outperforms existing augmentation methods, achieving state-of-the-art performance with improvements of up to 2.6% in unimodal SER and 3.2% in multimodal emotion recognition. Most importantly, we demonstrate that mutual information functions as both a regulariser and a measurable metric for generative quality, offering a systematic approach to data augmentation in affective computing.
