Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

Dengming Zhang; Weitao You; Ziheng Liu; Lingyun Sun; Pei Chen

Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

Dengming Zhang, Weitao You, Ziheng Liu, Lingyun Sun, Pei Chen

TL;DR

This work tackles Dynamic Music Emotion Recognition (DMER) and introduces Personalized DMER (PDMER) to account for individual emotion perception. It proposes DSAML, a dual-scale feature extractor and dual-scale attention transformer enhanced by an Imagebind Adapter, paired with a BiLSTM-based sequence predictor to model short- and long-range dependencies in $V$ and $A$. For PDMER, it adopts Model-Agnostic Meta-Learning (MAML) with a novel annotator-based task construction strategy, enabling rapid personalization from a single annotated sample per user. Objective and subjective experiments demonstrate state-of-the-art performance on traditional DMER and superior personalization alignment in PDMER, with ablations confirming the necessity of local/global attention, the diagonal attention loss, the Imagebind adapter, and the annotator-based meta-learning setup. The approach offers practical impact for personalized music emotion applications by efficiently adapting to individual perceptual differences.

Abstract

Dynamic Music Emotion Recognition (DMER) aims to predict the emotion of different moments in music, playing a crucial role in music information retrieval. The existing DMER methods struggle to capture long-term dependencies when dealing with sequence data, which limits their performance. Furthermore, these methods often overlook the influence of individual differences on emotion perception, even though everyone has their own personalized emotional perception in the real world. Motivated by these issues, we explore more effective sequence processing methods and introduce the Personalized DMER (PDMER) problem, which requires models to predict emotions that align with personalized perception. Specifically, we propose a Dual-Scale Attention-Based Meta-Learning (DSAML) method. This method fuses features from a dual-scale feature extractor and captures both short and long-term dependencies using a dual-scale attention transformer, improving the performance in traditional DMER. To achieve PDMER, we design a novel task construction strategy that divides tasks by annotators. Samples in a task are annotated by the same annotator, ensuring consistent perception. Leveraging this strategy alongside meta-learning, DSAML can predict personalized perception of emotions with just one personalized annotation sample. Our objective and subjective experiments demonstrate that our method can achieve state-of-the-art performance in both traditional DMER and PDMER.

Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

TL;DR

and

. For PDMER, it adopts Model-Agnostic Meta-Learning (MAML) with a novel annotator-based task construction strategy, enabling rapid personalization from a single annotated sample per user. Objective and subjective experiments demonstrate state-of-the-art performance on traditional DMER and superior personalization alignment in PDMER, with ablations confirming the necessity of local/global attention, the diagonal attention loss, the Imagebind adapter, and the annotator-based meta-learning setup. The approach offers practical impact for personalized music emotion applications by efficiently adapting to individual perceptual differences.

Abstract

Paper Structure (29 sections, 6 equations, 6 figures, 4 tables)

This paper contains 29 sections, 6 equations, 6 figures, 4 tables.

Introduction
Related Work
Dynamic Music Emotion Recognition
Personalized Music Emotion Recognition
Meta-learning
Methods
Problem Formulation
Model Architecture
Input Preprocessor
Dual-Scale Feature Extractor
Dual-Scale Attention Transformer
Sequence Predictor
Personalized Strategy
Training & Inference Process
Implementation Details
...and 14 more sections

Figures (6)

Figure 1: The differences between traditional DMER and PDMER. All charts represent the emotion valence/arousal (V/A) curve of music, where the x-axis represents time and the y-axis represents V/A values.
Figure 2: The architecture of the DSAML model.
Figure 3: The architecture of the Imagebind Adapter
Figure 4: Attention Map
Figure 5: Example of personalized annotations for the same song with three different annotators in the DEAM dataset.
...and 1 more figures

Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

TL;DR

Abstract

Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)