Table of Contents
Fetching ...

Adapting WavLM for Speech Emotion Recognition

Daria Diatlova, Anton Udalov, Vitalii Shutov, Egor Spirin

TL;DR

This work probes how fine-tuning WavLM Large for speech emotion recognition benefits from information accumulation strategies, notably time-dimension pooling and auxiliary data like gender and text. By comparing STD and attention pooling, and by injecting gender and text embeddings, the authors demonstrate that STD pooling plus gender conditioning improves performance, whereas text conditioning can degrade results. They also show that fusing multiple model variants via constrained optimization yields the strongest overall gains, achieving a higher F1-macro than any single model. The findings offer practical guidance for deploying SSL-based SER systems with auxiliary attributes and model ensembling in challenging, imbalanced settings such as the MSP Podcast Corpus and Odyssey Challenge 2024 workflows.

Abstract

Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.

Adapting WavLM for Speech Emotion Recognition

TL;DR

This work probes how fine-tuning WavLM Large for speech emotion recognition benefits from information accumulation strategies, notably time-dimension pooling and auxiliary data like gender and text. By comparing STD and attention pooling, and by injecting gender and text embeddings, the authors demonstrate that STD pooling plus gender conditioning improves performance, whereas text conditioning can degrade results. They also show that fusing multiple model variants via constrained optimization yields the strongest overall gains, achieving a higher F1-macro than any single model. The findings offer practical guidance for deploying SSL-based SER systems with auxiliary attributes and model ensembling in challenging, imbalanced settings such as the MSP Podcast Corpus and Odyssey Challenge 2024 workflows.

Abstract

Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.
Paper Structure (24 sections, 10 equations, 1 figure, 7 tables)

This paper contains 24 sections, 10 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Label distribution in the training and development sets. Note that the dataset is highly imbalanced, and the distribution of samples for each class relative to the total number of classes differs between the training and validation datasets.