Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning
Dengming Zhang, Weitao You, Ziheng Liu, Lingyun Sun, Pei Chen
TL;DR
This work tackles Dynamic Music Emotion Recognition (DMER) and introduces Personalized DMER (PDMER) to account for individual emotion perception. It proposes DSAML, a dual-scale feature extractor and dual-scale attention transformer enhanced by an Imagebind Adapter, paired with a BiLSTM-based sequence predictor to model short- and long-range dependencies in $V$ and $A$. For PDMER, it adopts Model-Agnostic Meta-Learning (MAML) with a novel annotator-based task construction strategy, enabling rapid personalization from a single annotated sample per user. Objective and subjective experiments demonstrate state-of-the-art performance on traditional DMER and superior personalization alignment in PDMER, with ablations confirming the necessity of local/global attention, the diagonal attention loss, the Imagebind adapter, and the annotator-based meta-learning setup. The approach offers practical impact for personalized music emotion applications by efficiently adapting to individual perceptual differences.
Abstract
Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of different moments in music, playing a crucial role in music information retrieval. The existing DMER methods struggle to capture long-term dependencies when dealing with sequence data, which limits their performance. Furthermore, these methods often overlook the influence of individual differences on emotion perception, even though everyone has their own personalized emotional perception in the real world. Motivated by these issues, we explore more effective sequence processing methods and introduce the Personalized DMER (PDMER) problem, which requires models to predict emotions that align with personalized perception. Specifically, we propose a Dual-Scale Attention-Based Meta-Learning (DSAML) method. This method fuses features from a dual-scale feature extractor and captures both short and long-term dependencies using a dual-scale attention transformer, improving the performance in traditional DMER. To achieve PDMER, we design a novel task construction strategy that divides tasks by annotators. Samples in a task are annotated by the same annotator, ensuring consistent perception. Leveraging this strategy alongside meta-learning, DSAML can predict personalized perception of emotions with just one personalized annotation sample. Our objective and subjective experiments demonstrate that our method can achieve state-of-the-art performance in both traditional DMER and PDMER.
