TALKPLAY: Multimodal Music Recommendation with Large Language Models
Seungheon Doh, Keunwoo Choi, Juhan Nam
TL;DR
TalkPlay introduces a unified, end-to-end multimodal music recommender powered by an LLM that generates music-relevant tokens as part of conversations. A novel music tokenizer encodes five modalities into 1024-cluster tokens per modality, creating a nearly 1.13 quadrillion-token space, which the LLM learns to navigate via vocabulary expansion and supervised fine-tuning on synthetic conversations. Empirical results show TalkPlay outperforms unimodal baselines in both retrieval quality (MRR/Hit@K) and response naturalness, with strong performance in extended multi-turn dialogues and favorable human judgments. This approach reduces system complexity by eliminating separate dialogue management or ranking modules and demonstrates the potential of generative retrieval for multimodal, conversational music recommendation.
Abstract
We present TALKPLAY, a novel multimodal music recommendation system that reformulates recommendation as a token generation problem using large language models (LLMs). By leveraging the instruction-following and natural language generation capabilities of LLMs, our system effectively recommends music from diverse user queries while generating contextually relevant responses. While pretrained LLMs are primarily designed for text modality, TALKPLAY extends their scope through two key innovations: a multimodal music tokenizer that encodes audio features, lyrics, metadata, semantic tags, and playlist co-occurrence signals; and a vocabulary expansion mechanism that enables unified processing and generation of both linguistic and music-relevant tokens. By integrating the recommendation system directly into the LLM architecture, TALKPLAY transforms conventional systems by: (1) unifying previous two-stage conversational recommendation systems (recommendation engines and dialogue managers) into a cohesive end-to-end system, (2) effectively utilizing long conversational context for recommendation while maintaining strong performance in extended multi-turn interactions, and (3) generating natural language responses for seamless user interaction. Our qualitative and quantitative evaluation demonstrates that TALKPLAY significantly outperforms unimodal approaches based solely on text or listening history in both recommendation performance and conversational naturalness.
