A Study on the Data Distribution Gap in Music Emotion Recognition

Joann Ching; Gerhard Widmer

A Study on the Data Distribution Gap in Music Emotion Recognition

Joann Ching, Gerhard Widmer

TL;DR

This work investigates the data distribution gap and genre biases in Music Emotion Recognition (MER) across five diverse, publicly available datasets with dimensional valence-arousal annotations. It systematically compares audio representations and finds that Jukebox embeddings provide superior predictive power for valence/arousal, yet cross-dataset generalization remains limited due to content and annotation divergences. The authors analyze distribution shifts and show that combining Jukebox embeddings with chroma features, together with training on a more diverse set of genres (EmoMusic, PMEmo, WTC), substantially improves both in-domain and out-of-domain MER performance, especially on unseen data like WCMED, proposing this simple fusion as a robust baseline. Overall, the paper highlights dataset and genre biases in MER and presents a practical, generalizable approach to mitigate them, enabling more reliable cross-domain emotion recognition in music.

Abstract

Music Emotion Recognition (MER) is a task deeply connected to human perception, relying heavily on subjective annotations collected from contributors. Prior studies tend to focus on specific musical styles rather than incorporating a diverse range of genres, such as rock and classical, within a single framework. In this paper, we address the task of recognizing emotion from audio content by investigating five datasets with dimensional emotion annotations -- EmoMusic, DEAM, PMEmo, WTC, and WCMED -- which span various musical styles. We demonstrate the problem of out-of-distribution generalization in a systematic experiment. By closely looking at multiple data and feature sets, we provide insight into genre-emotion relationships in existing data and examine potential genre dominance and dataset biases in certain feature representations. Based on these experiments, we arrive at a simple yet effective framework that combines embeddings extracted from the Jukebox model with chroma features and demonstrate how, alongside a combination of several diverse training sets, this permits us to train models with substantially improved cross-dataset generalization capabilities.

A Study on the Data Distribution Gap in Music Emotion Recognition

TL;DR

Abstract

A Study on the Data Distribution Gap in Music Emotion Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)