Table of Contents
Fetching ...

Music Recommendation with Large Language Models: Challenges, Opportunities, and Evaluation

Elena V. Epure, Yashar Deldjoo, Bruno Sguerra, Markus Schedl, Manuel Moussallam

TL;DR

This paper argues that the rise of large language models disrupts traditional information-retrieval based evaluation in music recommender systems, since LLMs generate text rather than rank items and can hallucinate or rely on stale knowledge. It provides a structured review of how LLMs affect user modeling, item modeling, and natural-language recommendations in music, and draws on NLP evaluation practices to propose a comprehensive framework of success and risk dimensions. The authors introduce practical evaluation bundles for LLM prompting strategies, including few-shot ICL, retrieval-augmented generation, and chain-of-thought prompting, and outline metrics for grounding, discovery, personalization, and cultural coverage, along with risk diagnostics such as hallucinations and bias. The paper advocates an updated, cross-disciplinary evaluation paradigm that aligns with the human experience of music recommendations, emphasizing transparency, ground-truth grounding, and continuous monitoring of biases and long-tail item coverage. This framework aims to guide researchers toward more robust, fair, and contextually meaningful evaluations of LLM-driven music recommendation systems.

Abstract

Music Recommender Systems (MRS) have long relied on an information-retrieval framing, where progress is measured mainly through accuracy on retrieval-oriented subtasks. While effective, this reductionist paradigm struggles to address the deeper question of what makes a good recommendation, and attempts to broaden evaluation, through user studies or fairness analyses, have had limited impact. The emergence of Large Language Models (LLMs) disrupts this framework: LLMs are generative rather than ranking-based, making standard accuracy metrics questionable. They also introduce challenges such as hallucinations, knowledge cutoffs, non-determinism, and opaque training data, rendering traditional train/test protocols difficult to interpret. At the same time, LLMs create new opportunities, enabling natural-language interaction and even allowing models to act as evaluators. This work argues that the shift toward LLM-driven MRS requires rethinking evaluation. We first review how LLMs reshape user modeling, item modeling, and natural-language recommendation in music. We then examine evaluation practices from NLP, highlighting methodologies and open challenges relevant to MRS. Finally, we synthesize insights-focusing on how LLM prompting applies to MRS, to outline a structured set of success and risk dimensions. Our goal is to provide the MRS community with an updated, pedagogical, and cross-disciplinary perspective on evaluation.

Music Recommendation with Large Language Models: Challenges, Opportunities, and Evaluation

TL;DR

This paper argues that the rise of large language models disrupts traditional information-retrieval based evaluation in music recommender systems, since LLMs generate text rather than rank items and can hallucinate or rely on stale knowledge. It provides a structured review of how LLMs affect user modeling, item modeling, and natural-language recommendations in music, and draws on NLP evaluation practices to propose a comprehensive framework of success and risk dimensions. The authors introduce practical evaluation bundles for LLM prompting strategies, including few-shot ICL, retrieval-augmented generation, and chain-of-thought prompting, and outline metrics for grounding, discovery, personalization, and cultural coverage, along with risk diagnostics such as hallucinations and bias. The paper advocates an updated, cross-disciplinary evaluation paradigm that aligns with the human experience of music recommendations, emphasizing transparency, ground-truth grounding, and continuous monitoring of biases and long-tail item coverage. This framework aims to guide researchers toward more robust, fair, and contextually meaningful evaluations of LLM-driven music recommendation systems.

Abstract

Music Recommender Systems (MRS) have long relied on an information-retrieval framing, where progress is measured mainly through accuracy on retrieval-oriented subtasks. While effective, this reductionist paradigm struggles to address the deeper question of what makes a good recommendation, and attempts to broaden evaluation, through user studies or fairness analyses, have had limited impact. The emergence of Large Language Models (LLMs) disrupts this framework: LLMs are generative rather than ranking-based, making standard accuracy metrics questionable. They also introduce challenges such as hallucinations, knowledge cutoffs, non-determinism, and opaque training data, rendering traditional train/test protocols difficult to interpret. At the same time, LLMs create new opportunities, enabling natural-language interaction and even allowing models to act as evaluators. This work argues that the shift toward LLM-driven MRS requires rethinking evaluation. We first review how LLMs reshape user modeling, item modeling, and natural-language recommendation in music. We then examine evaluation practices from NLP, highlighting methodologies and open challenges relevant to MRS. Finally, we synthesize insights-focusing on how LLM prompting applies to MRS, to outline a structured set of success and risk dimensions. Our goal is to provide the MRS community with an updated, pedagogical, and cross-disciplinary perspective on evaluation.

Paper Structure

This paper contains 38 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Paper's overview as a generic diagram presenting music recommendation with LLMs
  • Figure 2: Representation of the method to derive NL user preference profiles from consumption data.
  • Figure 3: Prompt example with the different task components highlighted.
  • Figure 4: NLP-driven evaluation framework: decision process and metrics overview.