Table of Contents
Fetching ...

TALKPLAY: Multimodal Music Recommendation with Large Language Models

Seungheon Doh, Keunwoo Choi, Juhan Nam

TL;DR

TalkPlay introduces a unified, end-to-end multimodal music recommender powered by an LLM that generates music-relevant tokens as part of conversations. A novel music tokenizer encodes five modalities into 1024-cluster tokens per modality, creating a nearly 1.13 quadrillion-token space, which the LLM learns to navigate via vocabulary expansion and supervised fine-tuning on synthetic conversations. Empirical results show TalkPlay outperforms unimodal baselines in both retrieval quality (MRR/Hit@K) and response naturalness, with strong performance in extended multi-turn dialogues and favorable human judgments. This approach reduces system complexity by eliminating separate dialogue management or ranking modules and demonstrates the potential of generative retrieval for multimodal, conversational music recommendation.

Abstract

We present TALKPLAY, a novel multimodal music recommendation system that reformulates recommendation as a token generation problem using large language models (LLMs). By leveraging the instruction-following and natural language generation capabilities of LLMs, our system effectively recommends music from diverse user queries while generating contextually relevant responses. While pretrained LLMs are primarily designed for text modality, TALKPLAY extends their scope through two key innovations: a multimodal music tokenizer that encodes audio features, lyrics, metadata, semantic tags, and playlist co-occurrence signals; and a vocabulary expansion mechanism that enables unified processing and generation of both linguistic and music-relevant tokens. By integrating the recommendation system directly into the LLM architecture, TALKPLAY transforms conventional systems by: (1) unifying previous two-stage conversational recommendation systems (recommendation engines and dialogue managers) into a cohesive end-to-end system, (2) effectively utilizing long conversational context for recommendation while maintaining strong performance in extended multi-turn interactions, and (3) generating natural language responses for seamless user interaction. Our qualitative and quantitative evaluation demonstrates that TALKPLAY significantly outperforms unimodal approaches based solely on text or listening history in both recommendation performance and conversational naturalness.

TALKPLAY: Multimodal Music Recommendation with Large Language Models

TL;DR

TalkPlay introduces a unified, end-to-end multimodal music recommender powered by an LLM that generates music-relevant tokens as part of conversations. A novel music tokenizer encodes five modalities into 1024-cluster tokens per modality, creating a nearly 1.13 quadrillion-token space, which the LLM learns to navigate via vocabulary expansion and supervised fine-tuning on synthetic conversations. Empirical results show TalkPlay outperforms unimodal baselines in both retrieval quality (MRR/Hit@K) and response naturalness, with strong performance in extended multi-turn dialogues and favorable human judgments. This approach reduces system complexity by eliminating separate dialogue management or ranking modules and demonstrates the potential of generative retrieval for multimodal, conversational music recommendation.

Abstract

We present TALKPLAY, a novel multimodal music recommendation system that reformulates recommendation as a token generation problem using large language models (LLMs). By leveraging the instruction-following and natural language generation capabilities of LLMs, our system effectively recommends music from diverse user queries while generating contextually relevant responses. While pretrained LLMs are primarily designed for text modality, TALKPLAY extends their scope through two key innovations: a multimodal music tokenizer that encodes audio features, lyrics, metadata, semantic tags, and playlist co-occurrence signals; and a vocabulary expansion mechanism that enables unified processing and generation of both linguistic and music-relevant tokens. By integrating the recommendation system directly into the LLM architecture, TALKPLAY transforms conventional systems by: (1) unifying previous two-stage conversational recommendation systems (recommendation engines and dialogue managers) into a cohesive end-to-end system, (2) effectively utilizing long conversational context for recommendation while maintaining strong performance in extended multi-turn interactions, and (3) generating natural language responses for seamless user interaction. Our qualitative and quantitative evaluation demonstrates that TALKPLAY significantly outperforms unimodal approaches based solely on text or listening history in both recommendation performance and conversational naturalness.

Paper Structure

This paper contains 21 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of TalkPlay: (1) The multimodal music tokenizer converts source data into modality embeddings $v$ and quantizes them into codebook indices $c$, which are mapped to music tokens $i$. (2) The LLM is fine-tuned on text and music token sequences. (3) The generated music tokens are mapped back to codebook indices and are used as queries to retrieve music items from the database.
  • Figure 2: Performance comparison across conversation turns. The x-axis shows the turn number in the dialogue, and the y-axis shows Mean Reciprocal Rank (MRR).
  • Figure 3: A-vs-B human evaluation results, comparing TalkPlay against existing conversational music recommendation models on recommendation relevance (left) and response naturalness (right).
  • Figure 4: Survey Interface of Subjective Evaluation.