Adapting WavLM for Speech Emotion Recognition
Daria Diatlova, Anton Udalov, Vitalii Shutov, Egor Spirin
TL;DR
This work probes how fine-tuning WavLM Large for speech emotion recognition benefits from information accumulation strategies, notably time-dimension pooling and auxiliary data like gender and text. By comparing STD and attention pooling, and by injecting gender and text embeddings, the authors demonstrate that STD pooling plus gender conditioning improves performance, whereas text conditioning can degrade results. They also show that fusing multiple model variants via constrained optimization yields the strongest overall gains, achieving a higher F1-macro than any single model. The findings offer practical guidance for deploying SSL-based SER systems with auxiliary attributes and model ensembling in challenging, imbalanced settings such as the MSP Podcast Corpus and Odyssey Challenge 2024 workflows.
Abstract
Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.
