Table of Contents
Fetching ...

ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering

Daeyong Kwon, SeungHeon Doh, Juhan Nam

TL;DR

The paper targets a knowledge-grounding gap in music question answering by introducing MusWikiDB, a music-specific vector database, and ArtistMus, an artist-centric benchmark. It evaluates retrieval-augmented generation (RAG) across open-source and proprietary models, showing substantial improvements in factual accuracy and robust transfer to out-of-domain data. Ablation studies demonstrate that RAG-style fine-tuning and two-stage retrieval further enhance both factual recall and contextual reasoning. The resources enable scalable, domain-specific QA in music and suggest a pathway toward democratizing access to high-quality music knowledge grounded in verifiable passages.

Abstract

Recent advances in large language models (LLMs) have transformed open-domain question answering, yet their effectiveness in music-related reasoning remains limited due to sparse music knowledge in pretraining data. While music information retrieval and computational musicology have explored structured and multimodal understanding, few resources support factual and contextual music question answering (MQA) grounded in artist metadata or historical context. We introduce MusWikiDB, a vector database of 3.2M passages from 144K music-related Wikipedia pages, and ArtistMus, a benchmark of 1,000 questions on 500 diverse artists with metadata such as genre, debut year, and topic. These resources enable systematic evaluation of retrieval-augmented generation (RAG) for MQA. Experiments show that RAG markedly improves factual accuracy; open-source models gain up to +56.8 percentage points (for example, Qwen3 8B improves from 35.0 to 91.8), approaching proprietary model performance. RAG-style fine-tuning further boosts both factual recall and contextual reasoning, improving results on both in-domain and out-of-domain benchmarks. MusWikiDB also yields approximately 6 percentage points higher accuracy and 40% faster retrieval than a general-purpose Wikipedia corpus. We release MusWikiDB and ArtistMus to advance research in music information retrieval and domain-specific question answering, establishing a foundation for retrieval-augmented reasoning in culturally rich domains such as music.

ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering

TL;DR

The paper targets a knowledge-grounding gap in music question answering by introducing MusWikiDB, a music-specific vector database, and ArtistMus, an artist-centric benchmark. It evaluates retrieval-augmented generation (RAG) across open-source and proprietary models, showing substantial improvements in factual accuracy and robust transfer to out-of-domain data. Ablation studies demonstrate that RAG-style fine-tuning and two-stage retrieval further enhance both factual recall and contextual reasoning. The resources enable scalable, domain-specific QA in music and suggest a pathway toward democratizing access to high-quality music knowledge grounded in verifiable passages.

Abstract

Recent advances in large language models (LLMs) have transformed open-domain question answering, yet their effectiveness in music-related reasoning remains limited due to sparse music knowledge in pretraining data. While music information retrieval and computational musicology have explored structured and multimodal understanding, few resources support factual and contextual music question answering (MQA) grounded in artist metadata or historical context. We introduce MusWikiDB, a vector database of 3.2M passages from 144K music-related Wikipedia pages, and ArtistMus, a benchmark of 1,000 questions on 500 diverse artists with metadata such as genre, debut year, and topic. These resources enable systematic evaluation of retrieval-augmented generation (RAG) for MQA. Experiments show that RAG markedly improves factual accuracy; open-source models gain up to +56.8 percentage points (for example, Qwen3 8B improves from 35.0 to 91.8), approaching proprietary model performance. RAG-style fine-tuning further boosts both factual recall and contextual reasoning, improving results on both in-domain and out-of-domain benchmarks. MusWikiDB also yields approximately 6 percentage points higher accuracy and 40% faster retrieval than a general-purpose Wikipedia corpus. We release MusWikiDB and ArtistMus to advance research in music information retrieval and domain-specific question answering, establishing a foundation for retrieval-augmented reasoning in culturally rich domains such as music.

Paper Structure

This paper contains 25 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Performance comparison of open-source LLMs on the ArtistMus benchmark under three settings (Zero-shot, RAG, and Rerank). The dashed lines denote closed-source models (GPT-4o and Gemini 2.5 Flash) evaluated in the same conditions. The Rerank strategy consistently yields the highest factual and contextual accuracy across all open models, narrowing the gap with proprietary systems.
  • Figure 2: Regional distribution of the 500 music artists in ArtistMus, spanning 163 countries (or regions) to ensure global diversity beyond the traditional U.S.- and Europe-centric focus.
  • Figure 3: Debut years of the 500 music artists in ArtistMus.
  • Figure 4: RAG performance and retrieval time for Wikipedia Corpus karpukhin2020dense and MusWikiDB.