Table of Contents
Fetching ...

QUESTER: Query Specification for Generative Retrieval

Arthur Satouf, Yuxuan Zong, Habiboulaye Amadou-Boubacar, Pablo Piantanida, Benjamin Piwowarski

TL;DR

QueStER reframes Generative Retrieval by learning to produce keyword-based query specifications that are processed by a BM25 engine, addressing scaling and generalization challenges of DocID-based GR approaches. It trains a small LLM with Group-Relative Policy Optimization (GRPO) using a SoftRank-based reward and cross-encoder distillation to guide learning, enabling effective and efficient retrieval. Empirical results on MS MARCO and BEIR show QueStER outperforming BM25 in both in-domain and out-of-domain settings, with a favorable latency around 28 ms per query when using a 4B backbone. By leveraging established search technologies and providing interpretable query specifications, QueStER offers a scalable alternative to large dense or generative IR models and sets a foundation for future exploration of structured query languages and hybrid backends.

Abstract

Generative Retrieval (GR) differs from the traditional index-then-retrieve pipeline by storing relevance in model parameters and directly generating document identifiers. However, GR often struggles to generalize and is costly to scale. We introduce QUESTER (QUEry SpecificaTion gEnerative Retrieval), which reframes GR as query specification generation - in this work, a simple keyword query handled by BM25 - using a (small) LLM. The policy is trained using reinforcement learning techniques (GRPO). Across in- and out-of-domain evaluations, we show that our model is more effective than BM25, and competitive with neural IR models, while maintaining a good efficiency

QUESTER: Query Specification for Generative Retrieval

TL;DR

QueStER reframes Generative Retrieval by learning to produce keyword-based query specifications that are processed by a BM25 engine, addressing scaling and generalization challenges of DocID-based GR approaches. It trains a small LLM with Group-Relative Policy Optimization (GRPO) using a SoftRank-based reward and cross-encoder distillation to guide learning, enabling effective and efficient retrieval. Empirical results on MS MARCO and BEIR show QueStER outperforming BM25 in both in-domain and out-of-domain settings, with a favorable latency around 28 ms per query when using a 4B backbone. By leveraging established search technologies and providing interpretable query specifications, QueStER offers a scalable alternative to large dense or generative IR models and sets a foundation for future exploration of structured query languages and hybrid backends.

Abstract

Generative Retrieval (GR) differs from the traditional index-then-retrieve pipeline by storing relevance in model parameters and directly generating document identifiers. However, GR often struggles to generalize and is costly to scale. We introduce QUESTER (QUEry SpecificaTion gEnerative Retrieval), which reframes GR as query specification generation - in this work, a simple keyword query handled by BM25 - using a (small) LLM. The policy is trained using reinforcement learning techniques (GRPO). Across in- and out-of-domain evaluations, we show that our model is more effective than BM25, and competitive with neural IR models, while maintaining a good efficiency

Paper Structure

This paper contains 32 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our query rewriting framework. A LLM generates multiple candidate queries, which retrieve documents using efficient index-based BoW IR models. The top-$k$ retrieved results are annotated with a cross-encoder reference from which the expectation of nDCG, SoftNDCG, can be computed. The resulting rewards are then used to update the policy for improved reformulation.
  • Figure 2: Trade-off between efficiency (ms/query, lower is better) and effectiveness (nDCG@10 on DL19, higher is better) for retrieval (generation time is not reported). Bubble size indicates model size (billions of parameters). Our QueStER offers a favorable balance, approaching MuGI and LameR in quality while being 4--7$\times$ faster.
  • Figure 3: Keyword overlap distribution between original queries (blue), generated queries (yellow), and ground-truth relevant passages (brown). Generated queries consistently inject discriminative vocabulary that aligns with relevant documents, both ID and OOD.