Table of Contents
Fetching ...

MusiCRS: Benchmarking Audio-Centric Conversational Recommendation

Rohan Surana, Amit Namburi, Gagan Mundada, Abhay Lal, Zachary Novack, Julian McAuley, Junda Wu

TL;DR

MusiCRS introduces the first audio-centric conversational music recommendation benchmark by linking authentic Reddit conversations to ground-truth audio tracks from YouTube. It provides 477 conversations across seven genres and 3,589 musical entities, with 100 evaluation candidates per thread and evaluation across audio-only, query-only, and audio+query modalities to study cross-modal integration. Experimental results reveal that current multimodal systems do not consistently outperform single-modality approaches, with retrieval-based embeddings often leading performance and genre-specific patterns (e.g., Jazz and Classical) surfacing, while grounding abstract musical concepts in audio remains challenging. By releasing the MusiCRS dataset, evaluation code, and baselines, the work offers a practical platform to advance audio-grounded multimodal conversational recommendation and highlights clear directions for improving cross-modal knowledge integration.

Abstract

Conversational recommendation has advanced rapidly with large language models (LLMs), yet music remains a uniquely challenging domain in which effective recommendations require reasoning over audio content beyond what text or metadata can capture. We present MusiCRS, the first benchmark for audio-centric conversational recommendation that links authentic user conversations from Reddit with corresponding tracks. MusiCRS includes 477 high-quality conversations spanning diverse genres (classical, hip-hop, electronic, metal, pop, indie, jazz), with 3,589 unique musical entities and audio grounding via YouTube links. MusiCRS supports evaluation under three input modality configurations: audio-only, query-only, and audio+query, allowing systematic comparison of audio-LLMs, retrieval models, and traditional approaches. Our experiments reveal that current systems struggle with cross-modal integration, with optimal performance frequently occurring in single-modality settings rather than multimodal configurations. This highlights fundamental limitations in cross-modal knowledge integration, as models excel at dialogue semantics but struggle when grounding abstract musical concepts in audio. To facilitate progress, we release the MusiCRS dataset (https://huggingface.co/datasets/rohan2810/MusiCRS), evaluation code (https://github.com/rohan2810/musiCRS), and comprehensive baselines.

MusiCRS: Benchmarking Audio-Centric Conversational Recommendation

TL;DR

MusiCRS introduces the first audio-centric conversational music recommendation benchmark by linking authentic Reddit conversations to ground-truth audio tracks from YouTube. It provides 477 conversations across seven genres and 3,589 musical entities, with 100 evaluation candidates per thread and evaluation across audio-only, query-only, and audio+query modalities to study cross-modal integration. Experimental results reveal that current multimodal systems do not consistently outperform single-modality approaches, with retrieval-based embeddings often leading performance and genre-specific patterns (e.g., Jazz and Classical) surfacing, while grounding abstract musical concepts in audio remains challenging. By releasing the MusiCRS dataset, evaluation code, and baselines, the work offers a practical platform to advance audio-grounded multimodal conversational recommendation and highlights clear directions for improving cross-modal knowledge integration.

Abstract

Conversational recommendation has advanced rapidly with large language models (LLMs), yet music remains a uniquely challenging domain in which effective recommendations require reasoning over audio content beyond what text or metadata can capture. We present MusiCRS, the first benchmark for audio-centric conversational recommendation that links authentic user conversations from Reddit with corresponding tracks. MusiCRS includes 477 high-quality conversations spanning diverse genres (classical, hip-hop, electronic, metal, pop, indie, jazz), with 3,589 unique musical entities and audio grounding via YouTube links. MusiCRS supports evaluation under three input modality configurations: audio-only, query-only, and audio+query, allowing systematic comparison of audio-LLMs, retrieval models, and traditional approaches. Our experiments reveal that current systems struggle with cross-modal integration, with optimal performance frequently occurring in single-modality settings rather than multimodal configurations. This highlights fundamental limitations in cross-modal knowledge integration, as models excel at dialogue semantics but struggle when grounding abstract musical concepts in audio. To facilitate progress, we release the MusiCRS dataset (https://huggingface.co/datasets/rohan2810/MusiCRS), evaluation code (https://github.com/rohan2810/musiCRS), and comprehensive baselines.

Paper Structure

This paper contains 9 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Limitations of existing conversational recommendation approaches (top) and comparison of music recommendation datasets (bottom). MusiCRS is the only benchmark combining authentic conversations, audio grounding, ground truth annotations, recommendation evaluation, and multimodal capabilities.
  • Figure 2: Representative examples from MusiCRS showing Reddit conversations, derived queries, audio, and candidates across genres.
  • Figure 3: Dataset genre and entity coverage (left) and dialogue structure statistics (right). MusiCRS encompasses diverse genres, artists, and song distributions, and a wide range of conversation styles. Macro-genre abbreviations: MM (Modern & Mainstream: hip-hop, pop); IM (Instrumental / Art Music: classical, jazz); AE (Alternative / Experimental: indie, metal, electronic).
  • Figure 4: Mean Reciprocal Rank (MRR) comparison across genres and model types.