Table of Contents
Fetching ...

MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition

Jarod Duret, Mickael Rouvier, Yannick Estève

TL;DR

This paper addresses robust eight-class emotion recognition from naturalistic MSP-Podcast data. It proposes a two-level ensemble of self-supervised pretrained encoders for speech and text, with five diverse sub-systems whose outputs are fused by an SVM at the score level. Key contributions include the exploration of SSL-based multimodal representations, a Jeffreys loss variant, a dual-encoder setup, and data-augmentation strategies with Whisper-generated transcripts and consensus re-labeling. The results show the fused system achieving a development Macro-F1 of about 0.35, demonstrating improved robustness on naturalistic speech and providing a replication-ready pipeline for multimodal SER.

Abstract

In this work, we detail our submission to the 2024 edition of the MSP-Podcast Speech Emotion Recognition (SER) Challenge. This challenge is divided into two distinct tasks: Categorical Emotion Recognition and Emotional Attribute Prediction. We concentrated our efforts on Task 1, which involves the categorical classification of eight emotional states using data from the MSP-Podcast dataset. Our approach employs an ensemble of models, each trained independently and then fused at the score level using a Support Vector Machine (SVM) classifier. The models were trained using various strategies, including Self-Supervised Learning (SSL) fine-tuning across different modalities: speech alone, text alone, and a combined speech and text approach. This joint training methodology aims to enhance the system's ability to accurately classify emotional states. This joint training methodology aims to enhance the system's ability to accurately classify emotional states. Thus, the system obtained F1-macro of 0.35\% on development set.

MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition

TL;DR

This paper addresses robust eight-class emotion recognition from naturalistic MSP-Podcast data. It proposes a two-level ensemble of self-supervised pretrained encoders for speech and text, with five diverse sub-systems whose outputs are fused by an SVM at the score level. Key contributions include the exploration of SSL-based multimodal representations, a Jeffreys loss variant, a dual-encoder setup, and data-augmentation strategies with Whisper-generated transcripts and consensus re-labeling. The results show the fused system achieving a development Macro-F1 of about 0.35, demonstrating improved robustness on naturalistic speech and providing a replication-ready pipeline for multimodal SER.

Abstract

In this work, we detail our submission to the 2024 edition of the MSP-Podcast Speech Emotion Recognition (SER) Challenge. This challenge is divided into two distinct tasks: Categorical Emotion Recognition and Emotional Attribute Prediction. We concentrated our efforts on Task 1, which involves the categorical classification of eight emotional states using data from the MSP-Podcast dataset. Our approach employs an ensemble of models, each trained independently and then fused at the score level using a Support Vector Machine (SVM) classifier. The models were trained using various strategies, including Self-Supervised Learning (SSL) fine-tuning across different modalities: speech alone, text alone, and a combined speech and text approach. This joint training methodology aims to enhance the system's ability to accurately classify emotional states. This joint training methodology aims to enhance the system's ability to accurately classify emotional states. Thus, the system obtained F1-macro of 0.35\% on development set.
Paper Structure (23 sections, 1 equation, 4 figures, 3 tables)

This paper contains 23 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of our speech emotion recognition system.
  • Figure 2: Illustration of the dual speech encoder emotion recognition system.
  • Figure 3: Illustration of our joint speech and text emotion recognition system.
  • Figure 4: The confusion matrix provided by the fusion system.