Table of Contents
Fetching ...

MusicSem: A Semantically Rich Language--Audio Dataset of Natural Music Descriptions

Rebecca Salganik, Teng Tu, Fei-Yueh Chen, Xiaohao Liu, Keifeng Lu, Ethan Luvisia, Zhiyao Duan, Guillaume Salha-Galvan, Anson Kahng, Yunshan Ma, Jian Kang

TL;DR

This paper introduces MusicSem, a dataset of 32,493 language-audio pairs derived from organic music-related discussions on the social media platform Reddit that captures a broader spectrum of musical semantics, reflecting how listeners naturally describe music in nuanced and human-centered ways.

Abstract

Music representation learning is central to music information retrieval and generation. While recent advances in multimodal learning have improved alignment between text and audio for tasks such as cross-modal music retrieval, text-to-music generation, and music-to-text generation, existing models often struggle to capture users' expressed intent in natural language descriptions of music. This observation suggests that the datasets used to train and evaluate these models do not fully reflect the broader and more natural forms of human discourse through which music is described. In this paper, we introduce MusicSem, a dataset of 32,493 language-audio pairs derived from organic music-related discussions on the social media platform Reddit. Compared to existing datasets, MusicSem captures a broader spectrum of musical semantics, reflecting how listeners naturally describe music in nuanced and human-centered ways. To structure these expressions, we propose a taxonomy of five semantic categories: descriptive, atmospheric, situational, metadata-related, and contextual. In addition to the construction, analysis, and release of MusicSem, we use the dataset to evaluate a wide range of multimodal models for retrieval and generation, highlighting the importance of modeling fine-grained semantics. Overall, MusicSem serves as a novel semantics-aware resource to support future research on human-aligned multimodal music representation learning.

MusicSem: A Semantically Rich Language--Audio Dataset of Natural Music Descriptions

TL;DR

This paper introduces MusicSem, a dataset of 32,493 language-audio pairs derived from organic music-related discussions on the social media platform Reddit that captures a broader spectrum of musical semantics, reflecting how listeners naturally describe music in nuanced and human-centered ways.

Abstract

Music representation learning is central to music information retrieval and generation. While recent advances in multimodal learning have improved alignment between text and audio for tasks such as cross-modal music retrieval, text-to-music generation, and music-to-text generation, existing models often struggle to capture users' expressed intent in natural language descriptions of music. This observation suggests that the datasets used to train and evaluate these models do not fully reflect the broader and more natural forms of human discourse through which music is described. In this paper, we introduce MusicSem, a dataset of 32,493 language-audio pairs derived from organic music-related discussions on the social media platform Reddit. Compared to existing datasets, MusicSem captures a broader spectrum of musical semantics, reflecting how listeners naturally describe music in nuanced and human-centered ways. To structure these expressions, we propose a taxonomy of five semantic categories: descriptive, atmospheric, situational, metadata-related, and contextual. In addition to the construction, analysis, and release of MusicSem, we use the dataset to evaluate a wide range of multimodal models for retrieval and generation, highlighting the importance of modeling fine-grained semantics. Overall, MusicSem serves as a novel semantics-aware resource to support future research on human-aligned multimodal music representation learning.
Paper Structure (88 sections, 4 equations, 7 figures, 14 tables, 1 algorithm)

This paper contains 88 sections, 4 equations, 7 figures, 14 tables, 1 algorithm.

Figures (7)

  • Figure 1: The MusicSem website provides access to the full dataset, detailed documentation, visualizations, and source code for data construction and experiments at: https://music-sem-web.vercel.app/.
  • Figure 2: Example of semantic content extracted from a Reddit post in MusicSem. The figure highlights how a single description can express a variety of different musical semantics, corresponding to the five categories defined in our taxonomy.
  • Figure 3: Overview of the extraction and verification pipeline used to construct MusicSem. After selecting the source Reddit threads, the dataset construction proceeds in two main stages: an extraction step that identifies candidate semantic content from the textual elements of each thread, and a summarization and verification step that reformulates the extracted content into sentence-like semantic annotations, verifies song--artist associations, and checks the plausibility of the extracted semantic information.
  • Figure 4: An example of personalization and contextualization on Reddit.
  • Figure 5: Visualizations for MusicSem. (a) Music genre distribution visualized as a word cloud, where larger font size indicates higher frequency. (b) Popularity distribution of songs. (c) Distribution of the number of words per language--audio pair.
  • ...and 2 more figures