Table of Contents
Fetching ...

Efficient, Property-Aligned Fan-Out Retrieval via RL-Compiled Diffusion

Pengcheng Jiang, Judith Yue Li, Moonkyung Ryu, R. Lily Hu, Kun Su, Zhong Yi Wan, Liam Hebert, Hao Peng, Jiawei Han, Dima Kuzmin, Craig Boutilier

TL;DR

R4T (Retrieve-for-Train), which uses RL once as an objective transducer in a three-step process, is proposed, which improves retrieval quality relative to strong baselines while reducing query-time fan-out latency by an order of magnitude.

Abstract

Many modern retrieval problems are set-valued: given a broad intent, the system must return a collection of results that optimizes higher-order properties (e.g., diversity, coverage, complementarity, coherence) while remaining grounded with respect to a fixed database. Set-valued objectives are typically non-decomposable and are not captured by existing supervised (query, content) datasets which only prioritize top-1 retrieval. Consequently, fan-out retrieval is often employed to generate diverse subqueries to retrieve item sets. While reinforcement learning (RL) can optimize set-level objectives via interaction, deploying an RL-tuned LLM for fan-out retrieval is prohibitively expensive at inference time. Conversely, diffusion-based generative retrieval enables efficient single-pass fan-out in embedding space, but requires objective-aligned training targets. To address these issues, we propose R4T (Retrieve-for-Train), which uses RL once as an objective transducer in a three-step process: (i) train a fan-out LLM with composite set-level rewards, (ii) synthesize objective-consistent training pairs, and (iii) train a lightweight diffusion retriever to model the conditional distribution of set-valued outputs. Across large-scale fashion and music benchmarks consisting of curated item sets, we show that R4T improves retrieval quality relative to strong baselines while reducing query-time fan-out latency by an order of magnitude.

Efficient, Property-Aligned Fan-Out Retrieval via RL-Compiled Diffusion

TL;DR

R4T (Retrieve-for-Train), which uses RL once as an objective transducer in a three-step process, is proposed, which improves retrieval quality relative to strong baselines while reducing query-time fan-out latency by an order of magnitude.

Abstract

Many modern retrieval problems are set-valued: given a broad intent, the system must return a collection of results that optimizes higher-order properties (e.g., diversity, coverage, complementarity, coherence) while remaining grounded with respect to a fixed database. Set-valued objectives are typically non-decomposable and are not captured by existing supervised (query, content) datasets which only prioritize top-1 retrieval. Consequently, fan-out retrieval is often employed to generate diverse subqueries to retrieve item sets. While reinforcement learning (RL) can optimize set-level objectives via interaction, deploying an RL-tuned LLM for fan-out retrieval is prohibitively expensive at inference time. Conversely, diffusion-based generative retrieval enables efficient single-pass fan-out in embedding space, but requires objective-aligned training targets. To address these issues, we propose R4T (Retrieve-for-Train), which uses RL once as an objective transducer in a three-step process: (i) train a fan-out LLM with composite set-level rewards, (ii) synthesize objective-consistent training pairs, and (iii) train a lightweight diffusion retriever to model the conditional distribution of set-valued outputs. Across large-scale fashion and music benchmarks consisting of curated item sets, we show that R4T improves retrieval quality relative to strong baselines while reducing query-time fan-out latency by an order of magnitude.
Paper Structure (31 sections, 9 equations, 6 figures, 3 tables)

This paper contains 31 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of the R4T. Step 1 (§\ref{['subsec:method_rl']}) trains a fan-out language model (FOLM) using RL to produce property-aligned sub-queries. Step 2 (§\ref{['subsec:method_syn']}) uses the trained FOLM to synthesize $(q,c)$ supervision data. Step 3 (§\ref{['subsec:method_diff']}) trains a diffusion-based fan-out retriever that samples content embeddings directly from query embeddings.
  • Figure 2: Illustration of rewards used in OAR.
  • Figure 3: Qualitative comparison on the Polyvore dataset for the open-ended abstract retrieval task given the broad query "Bohemian festival style". R4T generates semantically distinct, on-topic sub-queries that retrieve diverse outfit collections, while the Qwen3-4B zero-shot baseline produces largely paraphrastic sub-queries, leading to more homogeneous results.
  • Figure 4: Reward & Scores over Time (FOLM Training) for Open-Ended Abstract Retrieval.
  • Figure 5: Efficiency comparison between Autoregressive LLM and Diffusion Model for query fan-out ($k=10$) generation. Note that the x-axis follows a logarithmic scale (doubling at each step). The Diffusion Model demonstrates superior scalability and lower latency, maintaining sub-second performance for small batches and achieving an order-of-magnitude speedup at larger batch sizes.
  • ...and 1 more figures