IncompeBench: A Permissively Licensed, Fine-Grained Benchmark for Music Information Retrieval

Benjamin Clavié; Atoof Shakir; Jonah Turner; Sean Lee; Aamir Shakir; Makoto P. Kato

IncompeBench: A Permissively Licensed, Fine-Grained Benchmark for Music Information Retrieval

Benjamin Clavié, Atoof Shakir, Jonah Turner, Sean Lee, Aamir Shakir, Makoto P. Kato

TL;DR

IncompeBench tackles the lack of open, fine-grained benchmarks for music information retrieval by building a large, permissively licensed evaluation corpus from IncompeTech. It leverages a multi-stage pipeline, including song-card extraction with Gemini 3 Pro, DSPy-driven query generation, diverse candidate retrieval, and UMBRELA-inspired automated labeling, validated by expert humans to achieve a Cohen's $ abla$ of $0.94$. The work provides two evaluation modes (Strict and Lenient) and releases data, prompts, and generation code publicly, enabling reproducible, nuanced assessments of text-to-music retrieval systems. Baseline experiments across multiple models reveal meaningful gaps in fine-grained ranking capability, underscoring the benchmark's value for advancing MIR research and development.

Abstract

Multimodal Information Retrieval has made significant progress in recent years, leveraging the increasingly strong multimodal abilities of deep pre-trained models to represent information across modalities. Music Information Retrieval (MIR), in particular, has considerably increased in quality, with neural representations of music even making its way into everyday life products. However, there is a lack of high-quality benchmarks for evaluating music retrieval performance. To address this issue, we introduce \textbf{IncompeBench}, a carefully annotated benchmark comprising $1,574$ permissively licensed, high-quality music snippets, $500$ diverse queries, and over $125,000$ individual relevance judgements. These annotations were created through the use of a multi-stage pipeline, resulting in high agreement between human annotators and the generated data. The resulting datasets are publicly available at https://huggingface.co/datasets/mixedbread-ai/incompebench-strict and https://huggingface.co/datasets/mixedbread-ai/incompebench-lenient with the prompts available at https://github.com/mixedbread-ai/incompebench-programs.

IncompeBench: A Permissively Licensed, Fine-Grained Benchmark for Music Information Retrieval

TL;DR

. The work provides two evaluation modes (Strict and Lenient) and releases data, prompts, and generation code publicly, enabling reproducible, nuanced assessments of text-to-music retrieval systems. Baseline experiments across multiple models reveal meaningful gaps in fine-grained ranking capability, underscoring the benchmark's value for advancing MIR research and development.

Abstract

permissively licensed, high-quality music snippets,

diverse queries, and over

individual relevance judgements. These annotations were created through the use of a multi-stage pipeline, resulting in high agreement between human annotators and the generated data. The resulting datasets are publicly available at https://huggingface.co/datasets/mixedbread-ai/incompebench-strict and https://huggingface.co/datasets/mixedbread-ai/incompebench-lenient with the prompts available at https://github.com/mixedbread-ai/incompebench-programs.

Paper Structure (18 sections, 2 equations, 1 figure, 4 tables)

This paper contains 18 sections, 2 equations, 1 figure, 4 tables.

Introduction
Related Works
Music-Language Datasets and Benchmarks.
Text-to-Music Retrieval Models.
Benchmark Building
The Song Corpus: Choosing IncompeTech
Corpus Preparation
Query Generation
Creating Song Cards
Generation Step
Annotation Candidate Selection
Automated Labelling
IncompeBench
Benchmark Statistics
LLM-Human Agreement
...and 3 more sections

Figures (1)

Figure 1: Annotation distributions at the corpus and query level.

IncompeBench: A Permissively Licensed, Fine-Grained Benchmark for Music Information Retrieval

TL;DR

Abstract

IncompeBench: A Permissively Licensed, Fine-Grained Benchmark for Music Information Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (1)