Table of Contents
Fetching ...

A Study on the Data Distribution Gap in Music Emotion Recognition

Joann Ching, Gerhard Widmer

TL;DR

This work investigates the data distribution gap and genre biases in Music Emotion Recognition (MER) across five diverse, publicly available datasets with dimensional valence-arousal annotations. It systematically compares audio representations and finds that Jukebox embeddings provide superior predictive power for valence/arousal, yet cross-dataset generalization remains limited due to content and annotation divergences. The authors analyze distribution shifts and show that combining Jukebox embeddings with chroma features, together with training on a more diverse set of genres (EmoMusic, PMEmo, WTC), substantially improves both in-domain and out-of-domain MER performance, especially on unseen data like WCMED, proposing this simple fusion as a robust baseline. Overall, the paper highlights dataset and genre biases in MER and presents a practical, generalizable approach to mitigate them, enabling more reliable cross-domain emotion recognition in music.

Abstract

Music Emotion Recognition (MER) is a task deeply connected to human perception, relying heavily on subjective annotations collected from contributors. Prior studies tend to focus on specific musical styles rather than incorporating a diverse range of genres, such as rock and classical, within a single framework. In this paper, we address the task of recognizing emotion from audio content by investigating five datasets with dimensional emotion annotations -- EmoMusic, DEAM, PMEmo, WTC, and WCMED -- which span various musical styles. We demonstrate the problem of out-of-distribution generalization in a systematic experiment. By closely looking at multiple data and feature sets, we provide insight into genre-emotion relationships in existing data and examine potential genre dominance and dataset biases in certain feature representations. Based on these experiments, we arrive at a simple yet effective framework that combines embeddings extracted from the Jukebox model with chroma features and demonstrate how, alongside a combination of several diverse training sets, this permits us to train models with substantially improved cross-dataset generalization capabilities.

A Study on the Data Distribution Gap in Music Emotion Recognition

TL;DR

This work investigates the data distribution gap and genre biases in Music Emotion Recognition (MER) across five diverse, publicly available datasets with dimensional valence-arousal annotations. It systematically compares audio representations and finds that Jukebox embeddings provide superior predictive power for valence/arousal, yet cross-dataset generalization remains limited due to content and annotation divergences. The authors analyze distribution shifts and show that combining Jukebox embeddings with chroma features, together with training on a more diverse set of genres (EmoMusic, PMEmo, WTC), substantially improves both in-domain and out-of-domain MER performance, especially on unseen data like WCMED, proposing this simple fusion as a robust baseline. Overall, the paper highlights dataset and genre biases in MER and presents a practical, generalizable approach to mitigate them, enabling more reliable cross-domain emotion recognition in music.

Abstract

Music Emotion Recognition (MER) is a task deeply connected to human perception, relying heavily on subjective annotations collected from contributors. Prior studies tend to focus on specific musical styles rather than incorporating a diverse range of genres, such as rock and classical, within a single framework. In this paper, we address the task of recognizing emotion from audio content by investigating five datasets with dimensional emotion annotations -- EmoMusic, DEAM, PMEmo, WTC, and WCMED -- which span various musical styles. We demonstrate the problem of out-of-distribution generalization in a systematic experiment. By closely looking at multiple data and feature sets, we provide insight into genre-emotion relationships in existing data and examine potential genre dominance and dataset biases in certain feature representations. Based on these experiments, we arrive at a simple yet effective framework that combines embeddings extracted from the Jukebox model with chroma features and demonstrate how, alongside a combination of several diverse training sets, this permits us to train models with substantially improved cross-dataset generalization capabilities.

Paper Structure

This paper contains 12 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Jukebox embeddings of each dataset visualized using t-SNE, with the $\times$ symbols indicating the centroids of each dataset. The close relation between DEAM and EmoMusic can clearly be seen. Note the extreme positions of the two classical music sets WTC and WCMED, which nevertheless still appear to be distinct from each other in terms of feature representation.
  • Figure 2: Visualization of K-means clustering on Chroma (top) and Jukebox (bottom) features, along with the corresponding dataset and genre distribution.
  • Figure 3: Genre distributions of EmoMusic and DEAM with the officially provided genre labels. Some genres are only present in DEAM, as EmoMusic is a smaller dataset and does not cover as broad a range of labels.
  • Figure 4: t-SNE visualization of all embeddings considered in the task, along with the corresponding inter-centroid distance heatmap for each dataset pair. The mean and variance of the inter-centroid distances are also reported.