Table of Contents
Fetching ...

Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval with Semantic Guidance

Xuchan Bao, Judith Yue Li, Zhong Yi Wan, Kun Su, Timo Denk, Joonseok Lee, Dima Kuzmin, Fei Sha

TL;DR

Diff4Steer addresses the rigidity of deterministic seed embeddings in music retrieval by introducing a diffusion-based prior that generates a distribution of seed embeddings conditioned on cross-modal queries. It employs classifier-free guidance and optional text steering to produce diverse yet semantically aligned seeds, improving both embedding quality and retrieval performance. Thorough evaluation across image-to-music and text-to-music tasks (YT8M, MusicCaps, MelBench) shows competitive retrieval metrics and notably enhanced diversity compared to deterministic baselines and multi-modal baselines, while offering a lighter-weight alternative to large foundation models. The approach enables flexible, personalized music discovery but acknowledges computational costs and data-bias concerns, outlining directions for real-time scalability and bias mitigation.

Abstract

Modern music retrieval systems often rely on fixed representations of user preferences, limiting their ability to capture users' diverse and uncertain retrieval needs. To address this limitation, we introduce Diff4Steer, a novel generative retrieval framework that employs lightweight diffusion models to synthesize diverse seed embeddings from user queries that represent potential directions for music exploration. Unlike deterministic methods that map user query to a single point in embedding space, Diff4Steer provides a statistical prior on the target modality (audio) for retrieval, effectively capturing the uncertainty and multi-faceted nature of user preferences. Furthermore, Diff4Steer can be steered by image or text inputs, enabling more flexible and controllable music discovery combined with nearest neighbor search. Our framework outperforms deterministic regression methods and LLM-based generative retrieval baseline in terms of retrieval and ranking metrics, demonstrating its effectiveness in capturing user preferences, leading to more diverse and relevant recommendations. Listening examples are available at tinyurl.com/diff4steer.

Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval with Semantic Guidance

TL;DR

Diff4Steer addresses the rigidity of deterministic seed embeddings in music retrieval by introducing a diffusion-based prior that generates a distribution of seed embeddings conditioned on cross-modal queries. It employs classifier-free guidance and optional text steering to produce diverse yet semantically aligned seeds, improving both embedding quality and retrieval performance. Thorough evaluation across image-to-music and text-to-music tasks (YT8M, MusicCaps, MelBench) shows competitive retrieval metrics and notably enhanced diversity compared to deterministic baselines and multi-modal baselines, while offering a lighter-weight alternative to large foundation models. The approach enables flexible, personalized music discovery but acknowledges computational costs and data-bias concerns, outlining directions for real-time scalability and bias mitigation.

Abstract

Modern music retrieval systems often rely on fixed representations of user preferences, limiting their ability to capture users' diverse and uncertain retrieval needs. To address this limitation, we introduce Diff4Steer, a novel generative retrieval framework that employs lightweight diffusion models to synthesize diverse seed embeddings from user queries that represent potential directions for music exploration. Unlike deterministic methods that map user query to a single point in embedding space, Diff4Steer provides a statistical prior on the target modality (audio) for retrieval, effectively capturing the uncertainty and multi-faceted nature of user preferences. Furthermore, Diff4Steer can be steered by image or text inputs, enabling more flexible and controllable music discovery combined with nearest neighbor search. Our framework outperforms deterministic regression methods and LLM-based generative retrieval baseline in terms of retrieval and ranking metrics, demonstrating its effectiveness in capturing user preferences, leading to more diverse and relevant recommendations. Listening examples are available at tinyurl.com/diff4steer.

Paper Structure

This paper contains 30 sections, 20 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overall diagram of our generative retrieval framework for cross-modal music retrieval, with comparison to the regression and multi-modal LLM baselines.
  • Figure 2: Given an input image and various guided strengths (GS), we generate seed embeddings and retrieve their nearest music piece in MB. We show entropy and the probabilities of Top-3 genres. A higher entropy indicates more diverse music genres of retrieved music pieces.
  • Figure 3: Overall architecture for the diffusion backbone.
  • Figure 4: Architecture diagram for the ResNet blocks.
  • Figure 5: An example questionnaire used for human study.
  • ...and 5 more figures