Table of Contents
Fetching ...

A Survey on Multimodal Music Emotion Recognition

Rashini Liyanarachchi, Aditya Joshi, Erik Meijering

TL;DR

This survey addresses multimodal music emotion recognition (MMER) by detailing a four-stage framework (data selection, feature extraction, feature processing, emotion prediction) and surveying the state-of-the-art across audio, lyrics, visuals, symbolic, physiological, textual, and contextual modalities. It covers feature types (low- and mid-level audio, lyric embeddings, visual cues, MIDI, EEG/physiological signals, metadata), three fusion paradigms (feature-level, model-level, cross-modal), and the progression from SLEA to CEA datasets, highlighting the rise of deep learning and cross-modal processing. The paper identifies core gaps—limited multimodal datasets, subjective emotion labeling, and lack of standardized benchmarks—and proposes directions including unsupervised/transfer learning, real-time MMER, and richer modality integration. Its analysis underscores MMER's potential for improving music recommendation, therapeutic tools, and human–computer interaction through more robust, scalable, and interpretable models.

Abstract

Multimodal music emotion recognition (MMER) is an emerging discipline in music information retrieval that has experienced a surge in interest in recent years. This survey provides a comprehensive overview of the current state-of-the-art in MMER. Discussing the different approaches and techniques used in this field, the paper introduces a four-stage MMER framework, including multimodal data selection, feature extraction, feature processing, and final emotion prediction. The survey further reveals significant advancements in deep learning methods and the increasing importance of feature fusion techniques. Despite these advancements, challenges such as the need for large annotated datasets, datasets with more modalities, and real-time processing capabilities remain. This paper also contributes to the field by identifying critical gaps in current research and suggesting potential directions for future research. The gaps underscore the importance of developing robust, scalable, a interpretable models for MMER, with implications for applications in music recommendation systems, therapeutic tools, and entertainment.

A Survey on Multimodal Music Emotion Recognition

TL;DR

This survey addresses multimodal music emotion recognition (MMER) by detailing a four-stage framework (data selection, feature extraction, feature processing, emotion prediction) and surveying the state-of-the-art across audio, lyrics, visuals, symbolic, physiological, textual, and contextual modalities. It covers feature types (low- and mid-level audio, lyric embeddings, visual cues, MIDI, EEG/physiological signals, metadata), three fusion paradigms (feature-level, model-level, cross-modal), and the progression from SLEA to CEA datasets, highlighting the rise of deep learning and cross-modal processing. The paper identifies core gaps—limited multimodal datasets, subjective emotion labeling, and lack of standardized benchmarks—and proposes directions including unsupervised/transfer learning, real-time MMER, and richer modality integration. Its analysis underscores MMER's potential for improving music recommendation, therapeutic tools, and human–computer interaction through more robust, scalable, and interpretable models.

Abstract

Multimodal music emotion recognition (MMER) is an emerging discipline in music information retrieval that has experienced a surge in interest in recent years. This survey provides a comprehensive overview of the current state-of-the-art in MMER. Discussing the different approaches and techniques used in this field, the paper introduces a four-stage MMER framework, including multimodal data selection, feature extraction, feature processing, and final emotion prediction. The survey further reveals significant advancements in deep learning methods and the increasing importance of feature fusion techniques. Despite these advancements, challenges such as the need for large annotated datasets, datasets with more modalities, and real-time processing capabilities remain. This paper also contributes to the field by identifying critical gaps in current research and suggesting potential directions for future research. The gaps underscore the importance of developing robust, scalable, a interpretable models for MMER, with implications for applications in music recommendation systems, therapeutic tools, and entertainment.

Paper Structure

This paper contains 36 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Russell's Circumplex Model. Emotions are characterized by valence (ranging from negative to positive) along the horizontal axis and arousal (ranging from low to high) along the vertical axis.
  • Figure 2: Modalities used in MMER.
  • Figure 3: Framework summarizing past and current MMER methods.
  • Figure 4: Categorization of audio features.
  • Figure 5: Comparison of fusion methods in music emotion prediction.