Text2Tracks: Prompt-based Music Recommendation via Generative Retrieval
Enrico Palumbo, Gustavo Penha, Andreas Damianou, José Luis Redondo García, Timothy Christopher Heath, Alice Wang, Hugues Bouchard, Mounia Lalmas
TL;DR
The paper reframes prompt-based music recommendation as a generative retrieval task and introduces Text2Tracks, a decoder-based model that maps natural-language prompts directly to semantically rich track IDs. It systematically explores ID representations (content-based, integer-based, and learned semantic IDs) and finds that semantic IDs derived from collaborative-filtering embeddings offer the strongest performance, drastically reducing decoding steps. Text2Tracks with semantic IDs significantly outperforms both traditional retrieval baselines and other generative approaches (up to ~127% higher Hits@10), demonstrating the power of end-to-end generative retrieval for music recommendation. The work highlights practical benefits for conversational agents by enabling end-to-end generation of track identifiers, with potential extensions to joint text generation and broader media domains.
Abstract
In recent years, Large Language Models (LLMs) have enabled users to provide highly specific music recommendation requests using natural language prompts (e.g. "Can you recommend some old classics for slow dancing?"). In this setup, the recommended tracks are predicted by the LLM in an autoregressive way, i.e. the LLM generates the track titles one token at a time. While intuitive, this approach has several limitation. First, it is based on a general purpose tokenization that is optimized for words rather than for track titles. Second, it necessitates an additional entity resolution layer that matches the track title to the actual track identifier. Third, the number of decoding steps scales linearly with the length of the track title, slowing down inference. In this paper, we propose to address the task of prompt-based music recommendation as a generative retrieval task. Within this setting, we introduce novel, effective, and efficient representations of track identifiers that significantly outperform commonly used strategies. We introduce Text2Tracks, a generative retrieval model that learns a mapping from a user's music recommendation prompt to the relevant track IDs directly. Through an offline evaluation on a dataset of playlists with language inputs, we find that (1) the strategy to create IDs for music tracks is the most important factor for the effectiveness of Text2Tracks and semantic IDs significantly outperform commonly used strategies that rely on song titles as identifiers (2) provided with the right choice of track identifiers, Text2Tracks outperforms sparse and dense retrieval solutions trained to retrieve tracks from language prompts.
