Table of Contents
Fetching ...

JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata

Abhinaba Roy, Renhang Liu, Tongyu Lu, Dorien Herremans

TL;DR

JamendoMaxCaps addresses the shortage of large-scale music-language data by building a public dataset of over 360k instrumental tracks paired with model-generated captions and richly imputed metadata. The pipeline combines high-quality caption generation (via Qwen2-Audio), retrieval-based metadata augmentation (using MERT audio features and FLAN-T5 metadata embeddings), and in-context learning with a locally hosted Llama-2 to fill missing metadata. Extensive evaluations—objective BERT-Score and BLEU metrics, subjective retrieval studies, and listening tests—demonstrate that retrieval-contextualized imputation improves metadata quality and description relevance. This resource supports advances in music retrieval, multimodal representation learning, and generative music models, with practical benefits in privacy-aware, cost-effective research settings.

Abstract

We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 362,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata. We also introduce a retrieval system that leverages both musical features and metadata to identify similar songs, which are then used to fill in missing metadata using a local large language model (LLLM). This approach allows us to provide a more comprehensive and informative dataset for researchers working on music-language understanding tasks. We validate this approach quantitatively with five different measurements. By making the JamendoMaxCaps dataset publicly available, we provide a high-quality resource to advance research in music-language understanding tasks such as music retrieval, multimodal representation learning, and generative music models.

JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata

TL;DR

JamendoMaxCaps addresses the shortage of large-scale music-language data by building a public dataset of over 360k instrumental tracks paired with model-generated captions and richly imputed metadata. The pipeline combines high-quality caption generation (via Qwen2-Audio), retrieval-based metadata augmentation (using MERT audio features and FLAN-T5 metadata embeddings), and in-context learning with a locally hosted Llama-2 to fill missing metadata. Extensive evaluations—objective BERT-Score and BLEU metrics, subjective retrieval studies, and listening tests—demonstrate that retrieval-contextualized imputation improves metadata quality and description relevance. This resource supports advances in music retrieval, multimodal representation learning, and generative music models, with practical benefits in privacy-aware, cost-effective research settings.

Abstract

We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 362,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata. We also introduce a retrieval system that leverages both musical features and metadata to identify similar songs, which are then used to fill in missing metadata using a local large language model (LLLM). This approach allows us to provide a more comprehensive and informative dataset for researchers working on music-language understanding tasks. We validate this approach quantitatively with five different measurements. By making the JamendoMaxCaps dataset publicly available, we provide a high-quality resource to advance research in music-language understanding tasks such as music retrieval, multimodal representation learning, and generative music models.

Paper Structure

This paper contains 15 sections, 3 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of our proposed pipeline for the creation of the dataset.
  • Figure 2: The metadata imputation process.
  • Figure 3: Distribution of genres in original and imputed metadata.
  • Figure 4: Distribution of speed in original and imputed metadata.
  • Figure 5: Distribution of variable tags in original and imputed metadata.