Towards Unified Music Emotion Recognition across Dimensional and Categorical Models
Jaeyong Kang, Dorien Herremans
TL;DR
The paper tackles the challenge of heterogeneous emotion labels in Music Emotion Recognition by proposing a unified multitask learning framework that trains on both categorical and dimensional labels across multiple datasets. It integrates MERT embeddings with harmonic high-level features (chord progressions and key) and employs knowledge distillation to transfer knowledge from dataset-specific teachers to a single student model. Empirical results on MTG-Jamendo, DEAM, PMEmo, and EmoMusic show that the combination of MERT, chord/key features, multitask learning, and KD yields state-of-the-art performance on MTG-Jamendo (PR-AUC 0.1543, ROC-AUC 0.7810) and improved VA predictions. The approach enables cross-dataset generalization and provides an open-source implementation for broader adoption in MER and related affective computing tasks.
Abstract
One of the most significant challenges in Music Emotion Recognition (MER) comes from the fact that emotion labels can be heterogeneous across datasets with regard to the emotion representation, including categorical (e.g., happy, sad) versus dimensional labels (e.g., valence-arousal). In this paper, we present a unified multitask learning framework that combines these two types of labels and is thus able to be trained on multiple datasets. This framework uses an effective input representation that combines musical features (i.e., key and chords) and MERT embeddings. Moreover, knowledge distillation is employed to transfer the knowledge of teacher models trained on individual datasets to a student model, enhancing its ability to generalize across multiple tasks. To validate our proposed framework, we conducted extensive experiments on a variety of datasets, including MTG-Jamendo, DEAM, PMEmo, and EmoMusic. According to our experimental results, the inclusion of musical features, multitask learning, and knowledge distillation significantly enhances performance. In particular, our model outperforms the state-of-the-art models, including the best-performing model from the MediaEval 2021 competition on the MTG-Jamendo dataset. Our work makes a significant contribution to MER by allowing the combination of categorical and dimensional emotion labels in one unified framework, thus enabling training across datasets.
