Table of Contents
Fetching ...

1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem

Mingjie Chen, Hezhao Zhang, Yuanchao Li, Jiachen Luo, Wen Wu, Ziyang Ma, Peter Bell, Catherine Lai, Joshua Reiss, Lin Wang, Philip C. Woodland, Xie Chen, Huy Phan, Thomas Hain

TL;DR

This work tackles the problem of imbalanced emotion class distributions in speech emotion recognition by combining a focal loss with prior-based class weights within a seven-model multimodal ensemble. Each model processes frame-level audio and token-level text, derived from ASR transcripts corrected by a dedicated error-correction system, and the models are fused via majority voting to produce robust final predictions. The approach demonstrates that prioritizing minority classes through targeted loss weighting, when complemented by focal loss, yields improved macro-level performance despite some trade-offs on major classes, achieving a Macro-F1 of 35.69% and an accuracy of 37.32% on Odyssey 2024 Task-1, ranking first among 68 submissions. The study highlights the value of model diversity and textual information from ASR in SER, offering a practical strategy for handling class imbalance in real-world, spontaneous-speech datasets.

Abstract

Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for training, but problems remain as it sometimes causes over-fitting for minor classes or under-fitting for major classes. This paper presents the system developed by a multi-site team for the participation in the Odyssey 2024 Emotion Recognition Challenge Track-1. The challenge data has the aforementioned properties and therefore the presented systems aimed to tackle these issues, by introducing focal loss in optimisation when applying class weighted loss. Specifically, the focal loss is further weighted by prior-based class weights. Experimental results show that combining these two approaches brings better overall performance, by sacrificing performance on major classes. The system further employs a majority voting strategy to combine the outputs of an ensemble of 7 models. The models are trained independently, using different acoustic features and loss functions - with the aim to have different properties for different data. Hence these models show different performance preferences on major classes and minor classes. The ensemble system output obtained the best performance in the challenge, ranking top-1 among 68 submissions. It also outperformed all single models in our set. On the Odyssey 2024 Emotion Recognition Challenge Task-1 data the system obtained a Macro-F1 score of 35.69% and an accuracy of 37.32%.

1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem

TL;DR

This work tackles the problem of imbalanced emotion class distributions in speech emotion recognition by combining a focal loss with prior-based class weights within a seven-model multimodal ensemble. Each model processes frame-level audio and token-level text, derived from ASR transcripts corrected by a dedicated error-correction system, and the models are fused via majority voting to produce robust final predictions. The approach demonstrates that prioritizing minority classes through targeted loss weighting, when complemented by focal loss, yields improved macro-level performance despite some trade-offs on major classes, achieving a Macro-F1 of 35.69% and an accuracy of 37.32% on Odyssey 2024 Task-1, ranking first among 68 submissions. The study highlights the value of model diversity and textual information from ASR in SER, offering a practical strategy for handling class imbalance in real-world, spontaneous-speech datasets.

Abstract

Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for training, but problems remain as it sometimes causes over-fitting for minor classes or under-fitting for major classes. This paper presents the system developed by a multi-site team for the participation in the Odyssey 2024 Emotion Recognition Challenge Track-1. The challenge data has the aforementioned properties and therefore the presented systems aimed to tackle these issues, by introducing focal loss in optimisation when applying class weighted loss. Specifically, the focal loss is further weighted by prior-based class weights. Experimental results show that combining these two approaches brings better overall performance, by sacrificing performance on major classes. The system further employs a majority voting strategy to combine the outputs of an ensemble of 7 models. The models are trained independently, using different acoustic features and loss functions - with the aim to have different properties for different data. Hence these models show different performance preferences on major classes and minor classes. The ensemble system output obtained the best performance in the challenge, ranking top-1 among 68 submissions. It also outperformed all single models in our set. On the Odyssey 2024 Emotion Recognition Challenge Task-1 data the system obtained a Macro-F1 score of 35.69% and an accuracy of 37.32%.
Paper Structure (21 sections, 5 equations, 1 figure, 6 tables)