Table of Contents
Fetching ...

The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Ilaria Manco, Benno Weck, SeungHeon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, Elio Quinton, György Fazekas, Juhan Nam

TL;DR

SDD addresses the lack of publicly available, openly licensed audio-caption pairs for evaluating music-and-language models by providing 1.1k captions for 706 MTG-Jamendo tracks. It enables standardized cross-dataset benchmarking by comparing to MusicCaps across three tasks: music captioning, text-to-music generation, and music-language retrieval, using crowdsourced, diverse captions and manually validated subsets. The results reveal significant cross-domain gaps and caution against over-reliance on any single dataset due to in-domain biases and varying task definitions. The work argues for broader, diverse data and cross-dataset evaluation to obtain a more reliable appraisal of model capabilities in real-world settings.

Abstract

We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-and-language models. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-to-music generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance.

The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

TL;DR

SDD addresses the lack of publicly available, openly licensed audio-caption pairs for evaluating music-and-language models by providing 1.1k captions for 706 MTG-Jamendo tracks. It enables standardized cross-dataset benchmarking by comparing to MusicCaps across three tasks: music captioning, text-to-music generation, and music-language retrieval, using crowdsourced, diverse captions and manually validated subsets. The results reveal significant cross-domain gaps and caution against over-reliance on any single dataset due to in-domain biases and varying task definitions. The work argues for broader, diverse data and cross-dataset evaluation to obtain a more reliable appraisal of model capabilities in real-world settings.

Abstract

We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-and-language models. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-to-music generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance.
Paper Structure (15 sections, 3 figures, 4 tables)

This paper contains 15 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Distribution of the top nine countries of residence contributors reported in the on-boarding survey.
  • Figure 2: Distribution of music aspects covered by the most frequent word stems in the collected captions.
  • Figure : Distribution of the top nine countries of residence contributors reported in the on-boarding survey.