Table of Contents
Fetching ...

Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model

Chung-Soo Ahn, Rajib Rana, Sunil Sivadas, Carlos Busso, Jagath C. Rajapakse

TL;DR

This work tackles data scarcity in Speech Emotion Recognition by introducing a mutual-information-regularised generative augmentation framework that jointly leverages cross-modal audio-text representations and an InfoGAN-style feature generator. A semantically aligned baseline is trained with contrastive and mutual-information objectives to connect audio and text, followed by a GAN-based generator that maximises $I((c,t); \hat{h})$ to produce emotion-aware features. The approach extends to multimodal settings, enabling the creation of paired $(\hat{h}, \hat{t})$ samples and augmenting classifiers with diverse, emotionally coherent data; MI regularisation provides both a training objective and a measurable quality metric. Evaluations on IEMOCAP, MSP-IMPROV, and MSP-Podcast show consistent improvements over existing baselines in unimodal and multimodal SER, demonstrating the practical impact for data-efficient affective computing.

Abstract

Lack of large, well-annotated emotional speech corpora continues to limit the performance and robustness of speech emotion recognition (SER), particularly as models grow more complex and the demand for multimodal systems increases. While generative data augmentation offers a promising solution, existing approaches often produce emotionally inconsistent samples due to oversimplified conditioning on categorical labels. This paper introduces a novel mutual-information-regularised generative framework that combines cross-modal alignment with feature-level synthesis. Building on an InfoGAN-style architecture, our method first learns a semantically aligned audio-text representation space using pre-trained transformers and contrastive objectives. A feature generator is then trained to produce emotion-aware audio features while employing mutual information as a quantitative regulariser to ensure strong dependency between generated features and their conditioning variables. We extend this approach to multimodal settings, enabling the generation of novel, paired (audio, text) features. Comprehensive evaluation on three benchmark datasets (IEMOCAP, MSP-IMPROV, MSP-Podcast) demonstrates that our framework consistently outperforms existing augmentation methods, achieving state-of-the-art performance with improvements of up to 2.6% in unimodal SER and 3.2% in multimodal emotion recognition. Most importantly, we demonstrate that mutual information functions as both a regulariser and a measurable metric for generative quality, offering a systematic approach to data augmentation in affective computing.

Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model

TL;DR

This work tackles data scarcity in Speech Emotion Recognition by introducing a mutual-information-regularised generative augmentation framework that jointly leverages cross-modal audio-text representations and an InfoGAN-style feature generator. A semantically aligned baseline is trained with contrastive and mutual-information objectives to connect audio and text, followed by a GAN-based generator that maximises to produce emotion-aware features. The approach extends to multimodal settings, enabling the creation of paired samples and augmenting classifiers with diverse, emotionally coherent data; MI regularisation provides both a training objective and a measurable quality metric. Evaluations on IEMOCAP, MSP-IMPROV, and MSP-Podcast show consistent improvements over existing baselines in unimodal and multimodal SER, demonstrating the practical impact for data-efficient affective computing.

Abstract

Lack of large, well-annotated emotional speech corpora continues to limit the performance and robustness of speech emotion recognition (SER), particularly as models grow more complex and the demand for multimodal systems increases. While generative data augmentation offers a promising solution, existing approaches often produce emotionally inconsistent samples due to oversimplified conditioning on categorical labels. This paper introduces a novel mutual-information-regularised generative framework that combines cross-modal alignment with feature-level synthesis. Building on an InfoGAN-style architecture, our method first learns a semantically aligned audio-text representation space using pre-trained transformers and contrastive objectives. A feature generator is then trained to produce emotion-aware audio features while employing mutual information as a quantitative regulariser to ensure strong dependency between generated features and their conditioning variables. We extend this approach to multimodal settings, enabling the generation of novel, paired (audio, text) features. Comprehensive evaluation on three benchmark datasets (IEMOCAP, MSP-IMPROV, MSP-Podcast) demonstrates that our framework consistently outperforms existing augmentation methods, achieving state-of-the-art performance with improvements of up to 2.6% in unimodal SER and 3.2% in multimodal emotion recognition. Most importantly, we demonstrate that mutual information functions as both a regulariser and a measurable metric for generative quality, offering a systematic approach to data augmentation in affective computing.

Paper Structure

This paper contains 28 sections, 12 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The visual summary of proposed augmentation framework to improve SER, consisted of three stages. First stage (top left): baseline model is trained with contrastive loss and InfoNCE loss. Second stage (top right): InfoGAN, GAN with mutual information module that predicts latent $c'$ and $t'$ from generated $\hat{h}$, is trained and mutual information module re-uses the same prediction layers in the first stage. Third stage (bottom): we have two parallel stream in the final stage, SER case (bottom left) and multimodal SER case (bottom right), which is training the linear classification module with all possible input combinations by switching between $h$ or $\hat{h}$ and $t$ or $t'$. linear classifier layer is fine tuned with generated features.
  • Figure 2: T-SNE plot for both [real and generated combined] (left), generated (center) and real (right) audio features