Table of Contents
Fetching ...

Mouth Articulation-Based Anchoring for Improved Cross-Corpus Speech Emotion Recognition

Shreya G. Upadhyay, Ali N. Salman, Carlos Busso, Chi-Chun Lee

TL;DR

This paper addresses cross-corpus speech emotion recognition (SER) by shifting focus from noisy acoustic feature alignment to stable mouth articulatory gestures (AG) as anchors. It introduces AG-anchored cross-corpus SER (AG-CC), which clusters mouth gestures from multimodal corpora and uses a soft-weighted triplet loss to align acoustic embeddings within common AG clusters, combining this with standard SER loss as $L_{Total} = L_{ER} + \gamma L_{AG}$. Experiments on CREMA-D and MSP-IMPROV show that AG-CC improves cross-corpus transfer over phoneme-anchored and layer-anchored baselines and reveals meaningful associations between AG patterns and acoustic features. The approach offers a robust, linguistically grounded constraint for emotion transfer and suggests potential for cross-lingual generalization and more robust SER in varied recording conditions.

Abstract

Cross-corpus speech emotion recognition (SER) plays a vital role in numerous practical applications. Traditional approaches to cross-corpus emotion transfer often concentrate on adapting acoustic features to align with different corpora, domains, or labels. However, acoustic features are inherently variable and error-prone due to factors like speaker differences, domain shifts, and recording conditions. To address these challenges, this study adopts a novel contrastive approach by focusing on emotion-specific articulatory gestures as the core elements for analysis. By shifting the emphasis on the more stable and consistent articulatory gestures, we aim to enhance emotion transfer learning in SER tasks. Our research leverages the CREMA-D and MSP-IMPROV corpora as benchmarks and it reveals valuable insights into the commonality and reliability of these articulatory gestures. The findings highlight mouth articulatory gesture potential as a better constraint for improving emotion recognition across different settings or domains.

Mouth Articulation-Based Anchoring for Improved Cross-Corpus Speech Emotion Recognition

TL;DR

This paper addresses cross-corpus speech emotion recognition (SER) by shifting focus from noisy acoustic feature alignment to stable mouth articulatory gestures (AG) as anchors. It introduces AG-anchored cross-corpus SER (AG-CC), which clusters mouth gestures from multimodal corpora and uses a soft-weighted triplet loss to align acoustic embeddings within common AG clusters, combining this with standard SER loss as . Experiments on CREMA-D and MSP-IMPROV show that AG-CC improves cross-corpus transfer over phoneme-anchored and layer-anchored baselines and reveals meaningful associations between AG patterns and acoustic features. The approach offers a robust, linguistically grounded constraint for emotion transfer and suggests potential for cross-lingual generalization and more robust SER in varied recording conditions.

Abstract

Cross-corpus speech emotion recognition (SER) plays a vital role in numerous practical applications. Traditional approaches to cross-corpus emotion transfer often concentrate on adapting acoustic features to align with different corpora, domains, or labels. However, acoustic features are inherently variable and error-prone due to factors like speaker differences, domain shifts, and recording conditions. To address these challenges, this study adopts a novel contrastive approach by focusing on emotion-specific articulatory gestures as the core elements for analysis. By shifting the emphasis on the more stable and consistent articulatory gestures, we aim to enhance emotion transfer learning in SER tasks. Our research leverages the CREMA-D and MSP-IMPROV corpora as benchmarks and it reveals valuable insights into the commonality and reliability of these articulatory gestures. The findings highlight mouth articulatory gesture potential as a better constraint for improving emotion recognition across different settings or domains.
Paper Structure (10 sections, 4 equations, 3 figures, 2 tables)

This paper contains 10 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Clustered AG patterns for /A/ and /i/ from both corpora come from two AG clusters; shows the mean pattern at each frame, with standard deviation indicated over 50 samples for each vowel.
  • Figure 2: Visualization of association between AG cluster and acoustic cluster across different emotions.
  • Figure 3: Proposed mouth articulation-based anchoring architecture for cross-corpus SER.