Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition

Upasana Tiwari; Rupayan Chakraborty; Sunil Kumar Kopparapu

Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition

Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu

TL;DR

The paper tackles the problem of robust Speech Emotion Recognition (SER) under real-world noise and cross-corpus variability. It proposes a two-stage framework: Emotion-Disentangled Representation Learning (EDRL) to extract class-specific discriminative features while preserving shared structures, followed by Multiblock Embedding Alignment (MEA) using MBPLS to map embeddings into a joint latent space aligned with the original input. The approach yields an embeddings pipeline that does not require target-domain fine-tuning or data augmentation, yet significantly improves performance in unseen noisy and cross-corpus conditions, as demonstrated on IEMOCAP and cross-dataset evaluations with arousal/valence labeling. The work highlights the importance of disentangling emotion-specific and shared features and aligning embeddings to preserve both intra-class discriminativity and inter-class coherence, offering a practical path toward more reliable real-world SER systems.

Abstract

Effectiveness of speech emotion recognition in real-world scenarios is often hindered by noisy environments and variability across datasets. This paper introduces a two-step approach to enhance the robustness and generalization of speech emotion recognition models through improved representation learning. First, our model employs EDRL (Emotion-Disentangled Representation Learning) to extract class-specific discriminative features while preserving shared similarities across emotion categories. Next, MEA (Multiblock Embedding Alignment) refines these representations by projecting them into a joint discriminative latent subspace that maximizes covariance with the original speech input. The learned EDRL-MEA embeddings are subsequently used to train an emotion classifier using clean samples from publicly available datasets, and are evaluated on unseen noisy and cross-corpus speech samples. Improved performance under these challenging conditions demonstrates the effectiveness of the proposed method.

Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition

TL;DR

Abstract

Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)