MusiCRS: Benchmarking Audio-Centric Conversational Recommendation
Rohan Surana, Amit Namburi, Gagan Mundada, Abhay Lal, Zachary Novack, Julian McAuley, Junda Wu
TL;DR
MusiCRS introduces the first audio-centric conversational music recommendation benchmark by linking authentic Reddit conversations to ground-truth audio tracks from YouTube. It provides 477 conversations across seven genres and 3,589 musical entities, with 100 evaluation candidates per thread and evaluation across audio-only, query-only, and audio+query modalities to study cross-modal integration. Experimental results reveal that current multimodal systems do not consistently outperform single-modality approaches, with retrieval-based embeddings often leading performance and genre-specific patterns (e.g., Jazz and Classical) surfacing, while grounding abstract musical concepts in audio remains challenging. By releasing the MusiCRS dataset, evaluation code, and baselines, the work offers a practical platform to advance audio-grounded multimodal conversational recommendation and highlights clear directions for improving cross-modal knowledge integration.
Abstract
Conversational recommendation has advanced rapidly with large language models (LLMs), yet music remains a uniquely challenging domain in which effective recommendations require reasoning over audio content beyond what text or metadata can capture. We present MusiCRS, the first benchmark for audio-centric conversational recommendation that links authentic user conversations from Reddit with corresponding tracks. MusiCRS includes 477 high-quality conversations spanning diverse genres (classical, hip-hop, electronic, metal, pop, indie, jazz), with 3,589 unique musical entities and audio grounding via YouTube links. MusiCRS supports evaluation under three input modality configurations: audio-only, query-only, and audio+query, allowing systematic comparison of audio-LLMs, retrieval models, and traditional approaches. Our experiments reveal that current systems struggle with cross-modal integration, with optimal performance frequently occurring in single-modality settings rather than multimodal configurations. This highlights fundamental limitations in cross-modal knowledge integration, as models excel at dialogue semantics but struggle when grounding abstract musical concepts in audio. To facilitate progress, we release the MusiCRS dataset (https://huggingface.co/datasets/rohan2810/MusiCRS), evaluation code (https://github.com/rohan2810/musiCRS), and comprehensive baselines.
