Table of Contents
Fetching ...

MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation

Daeyong Kwon, SeungHeon Doh, Juhan Nam

TL;DR

MusT-RAG tackles the challenge of limited music-specific knowledge in general LLMs by introducing a retrieval-augmented framework and a music-focused vector database, MusWikiDB, to support text-only Music Question Answering. It combines retrieval with context-aware generation and a RAG-style fine-tuning regime, yielding substantial factual gains over strong baselines and demonstrating robustness to domain shifts as shown on ArtistMus and TrustMus. The approach delivers faster, more accurate music knowledge grounding without full model retraining and provides two new resources—MusWikiDB and ArtistMus—for future research. Overall, MusT-RAG advances practical, domain-aware LLM deployment in music information tasks and highlights the benefits of retrieval-grounded, context-infused learning for specialized domains.

Abstract

Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs' effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs' music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.

MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation

TL;DR

MusT-RAG tackles the challenge of limited music-specific knowledge in general LLMs by introducing a retrieval-augmented framework and a music-focused vector database, MusWikiDB, to support text-only Music Question Answering. It combines retrieval with context-aware generation and a RAG-style fine-tuning regime, yielding substantial factual gains over strong baselines and demonstrating robustness to domain shifts as shown on ArtistMus and TrustMus. The approach delivers faster, more accurate music knowledge grounding without full model retraining and provides two new resources—MusWikiDB and ArtistMus—for future research. Overall, MusT-RAG advances practical, domain-aware LLM deployment in music information tasks and highlights the benefits of retrieval-grounded, context-infused learning for specialized domains.

Abstract

Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs' effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs' music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.

Paper Structure

This paper contains 21 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of our MusT-RAG framework. The retriever searches for relevant information in MusWikiDB based on similarity for music-related queries, and augments the generator's prompt with this information to generate an answer.
  • Figure 2: RAG performance and retrieval time for Wikipedia Corpus karpukhin2020dense and MusWikiDB.