Table of Contents
Fetching ...

Text2Tracks: Prompt-based Music Recommendation via Generative Retrieval

Enrico Palumbo, Gustavo Penha, Andreas Damianou, José Luis Redondo García, Timothy Christopher Heath, Alice Wang, Hugues Bouchard, Mounia Lalmas

TL;DR

The paper reframes prompt-based music recommendation as a generative retrieval task and introduces Text2Tracks, a decoder-based model that maps natural-language prompts directly to semantically rich track IDs. It systematically explores ID representations (content-based, integer-based, and learned semantic IDs) and finds that semantic IDs derived from collaborative-filtering embeddings offer the strongest performance, drastically reducing decoding steps. Text2Tracks with semantic IDs significantly outperforms both traditional retrieval baselines and other generative approaches (up to ~127% higher Hits@10), demonstrating the power of end-to-end generative retrieval for music recommendation. The work highlights practical benefits for conversational agents by enabling end-to-end generation of track identifiers, with potential extensions to joint text generation and broader media domains.

Abstract

In recent years, Large Language Models (LLMs) have enabled users to provide highly specific music recommendation requests using natural language prompts (e.g. "Can you recommend some old classics for slow dancing?"). In this setup, the recommended tracks are predicted by the LLM in an autoregressive way, i.e. the LLM generates the track titles one token at a time. While intuitive, this approach has several limitation. First, it is based on a general purpose tokenization that is optimized for words rather than for track titles. Second, it necessitates an additional entity resolution layer that matches the track title to the actual track identifier. Third, the number of decoding steps scales linearly with the length of the track title, slowing down inference. In this paper, we propose to address the task of prompt-based music recommendation as a generative retrieval task. Within this setting, we introduce novel, effective, and efficient representations of track identifiers that significantly outperform commonly used strategies. We introduce Text2Tracks, a generative retrieval model that learns a mapping from a user's music recommendation prompt to the relevant track IDs directly. Through an offline evaluation on a dataset of playlists with language inputs, we find that (1) the strategy to create IDs for music tracks is the most important factor for the effectiveness of Text2Tracks and semantic IDs significantly outperform commonly used strategies that rely on song titles as identifiers (2) provided with the right choice of track identifiers, Text2Tracks outperforms sparse and dense retrieval solutions trained to retrieve tracks from language prompts.

Text2Tracks: Prompt-based Music Recommendation via Generative Retrieval

TL;DR

The paper reframes prompt-based music recommendation as a generative retrieval task and introduces Text2Tracks, a decoder-based model that maps natural-language prompts directly to semantically rich track IDs. It systematically explores ID representations (content-based, integer-based, and learned semantic IDs) and finds that semantic IDs derived from collaborative-filtering embeddings offer the strongest performance, drastically reducing decoding steps. Text2Tracks with semantic IDs significantly outperforms both traditional retrieval baselines and other generative approaches (up to ~127% higher Hits@10), demonstrating the power of end-to-end generative retrieval for music recommendation. The work highlights practical benefits for conversational agents by enabling end-to-end generation of track identifiers, with potential extensions to joint text generation and broader media domains.

Abstract

In recent years, Large Language Models (LLMs) have enabled users to provide highly specific music recommendation requests using natural language prompts (e.g. "Can you recommend some old classics for slow dancing?"). In this setup, the recommended tracks are predicted by the LLM in an autoregressive way, i.e. the LLM generates the track titles one token at a time. While intuitive, this approach has several limitation. First, it is based on a general purpose tokenization that is optimized for words rather than for track titles. Second, it necessitates an additional entity resolution layer that matches the track title to the actual track identifier. Third, the number of decoding steps scales linearly with the length of the track title, slowing down inference. In this paper, we propose to address the task of prompt-based music recommendation as a generative retrieval task. Within this setting, we introduce novel, effective, and efficient representations of track identifiers that significantly outperform commonly used strategies. We introduce Text2Tracks, a generative retrieval model that learns a mapping from a user's music recommendation prompt to the relevant track IDs directly. Through an offline evaluation on a dataset of playlists with language inputs, we find that (1) the strategy to create IDs for music tracks is the most important factor for the effectiveness of Text2Tracks and semantic IDs significantly outperform commonly used strategies that rely on song titles as identifiers (2) provided with the right choice of track identifiers, Text2Tracks outperforms sparse and dense retrieval solutions trained to retrieve tracks from language prompts.

Paper Structure

This paper contains 21 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (a) Pre-trained LLMs deal with prompt-based music recommendation by generating the recommended artist name and song title, which are then resolved against an index to find the actual track identifiers. (b) Text2Tracks is a generative track retrieval model composed of a component that represents tracks, i.e. the ID strategy $\phi$ that maps from a track to its ID, and a backbone LM that is fine-tuned with pairs of music recommendation queries and track IDs. At test time Text2Tracks generates a set of recommended tracks using a diversified beam search strategy.
  • Figure 2: The three categories of ID strategies using "_" as a separator. Content-based strategies use textual metadata associated with the item. Integer-based approaches use random integer values for each metadata, potentially leveraging the hierarchy of metadata available. Learned approaches go from embeddings that represent the item to hierarchically structured tokens.
  • Figure 3: The effect on Hits@10 and on the diversity of the artists when increasing the homogeneity penalty hyperparameter, which applies a penalty for generating tokens that were selected in other beam search groups at prediction with Text2Tracks