Table of Contents
Fetching ...

ZeroGR: A Generalizable and Scalable Framework for Zero-Shot Generative Retrieval

Weiwei Sun, Keyi Kong, Xinyu Ma, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Zhaochun Ren, Yiming Yang

TL;DR

ZeroGR addresses the challenge of zero-shot generalization in generative retrieval by introducing an instruction-driven framework that unifies docid design, corpus indexing, and decoding. It deploys a model-based docid generator, an instruction-tuned pseudo-query generator, and a reverse annealing decoding strategy to build task-specific generative indices without supervision, guided by natural language instructions. The approach is trained on a large, diversified IR corpus and evaluated on BEIR and MAIR, where it achieves state-of-the-art zero-shot performance and demonstrates strong cross-domain transfer with a relatively small model. The work advances practical IR by enabling robust, instruction-guided generative retrieval across heterogeneous data sources and tasks, with scalable gains from instruction-tuning and model size.

Abstract

Generative retrieval (GR) reformulates information retrieval (IR) by framing it as the generation of document identifiers (docids), thereby enabling an end-to-end optimization and seamless integration with generative language models (LMs). Despite notable progress under supervised training, GR still struggles to generalize to zero-shot IR scenarios, which are prevalent in real-world applications. To tackle this challenge, we propose \textsc{ZeroGR}, a zero-shot generative retrieval framework that leverages natural language instructions to extend GR across a wide range of IR tasks. Specifically, \textsc{ZeroGR} is composed of three key components: (i) an LM-based docid generator that unifies heterogeneous documents (e.g., text, tables, code) into semantically meaningful docids; (ii) an instruction-tuned query generator that generates diverse types of queries from natural language task descriptions to enhance corpus indexing; and (iii) a reverse annealing decoding strategy to balance precision and recall during docid generation. We investigate the impact of instruction fine-tuning scale and find that performance consistently improves as the number of IR tasks encountered during training increases. Empirical results on the BEIR and MAIR benchmarks demonstrate that \textsc{ZeroGR} outperforms strong dense retrieval and generative baselines in zero-shot settings, establishing a new state-of-the-art for instruction-driven GR.

ZeroGR: A Generalizable and Scalable Framework for Zero-Shot Generative Retrieval

TL;DR

ZeroGR addresses the challenge of zero-shot generalization in generative retrieval by introducing an instruction-driven framework that unifies docid design, corpus indexing, and decoding. It deploys a model-based docid generator, an instruction-tuned pseudo-query generator, and a reverse annealing decoding strategy to build task-specific generative indices without supervision, guided by natural language instructions. The approach is trained on a large, diversified IR corpus and evaluated on BEIR and MAIR, where it achieves state-of-the-art zero-shot performance and demonstrates strong cross-domain transfer with a relatively small model. The work advances practical IR by enabling robust, instruction-guided generative retrieval across heterogeneous data sources and tasks, with scalable gains from instruction-tuning and model size.

Abstract

Generative retrieval (GR) reformulates information retrieval (IR) by framing it as the generation of document identifiers (docids), thereby enabling an end-to-end optimization and seamless integration with generative language models (LMs). Despite notable progress under supervised training, GR still struggles to generalize to zero-shot IR scenarios, which are prevalent in real-world applications. To tackle this challenge, we propose \textsc{ZeroGR}, a zero-shot generative retrieval framework that leverages natural language instructions to extend GR across a wide range of IR tasks. Specifically, \textsc{ZeroGR} is composed of three key components: (i) an LM-based docid generator that unifies heterogeneous documents (e.g., text, tables, code) into semantically meaningful docids; (ii) an instruction-tuned query generator that generates diverse types of queries from natural language task descriptions to enhance corpus indexing; and (iii) a reverse annealing decoding strategy to balance precision and recall during docid generation. We investigate the impact of instruction fine-tuning scale and find that performance consistently improves as the number of IR tasks encountered during training increases. Empirical results on the BEIR and MAIR benchmarks demonstrate that \textsc{ZeroGR} outperforms strong dense retrieval and generative baselines in zero-shot settings, establishing a new state-of-the-art for instruction-driven GR.

Paper Structure

This paper contains 28 sections, 6 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: An overview of ZeroGR. Given a document collection, ZeroGR converts them into unified DocID representations, generates diverse pseudo-queries, and builds a generative retrieval index. During online retrieval, ZeroGR decodes docids with reverse-annealed temperature scheduling to balance precision and recall.
  • Figure 2: Performance (nDCG@10) of different generative retrieval models across various datasets on BEIR.
  • Figure 3: Performance (Acc@1) on unseen subset of MAIR.
  • Figure 4: Model performance on unseen-dev tasks as a function of the number of instruction-tuning tasks. We gradually increase the number of instruction-tuning tasks, starting from MS MARCO, and incrementally add open-domain QA datasets (e.g., NQ), BEIR-Train sets (e.g., NFC), MTEB-Train data (e.g., NLI), and finally the ZeroGR-Train collection, which includes 60 tasks across 6 domains. Left: More instruction-tuning tasks lead to more diverse queries. Middle: More instruction-tuning tasks reduce docid conflicts. Right: More instruction-tuning tasks improve the Acc@1 score.
  • Figure 5: Left: Comparison of different docid designs. Middle: Acc@1 vs. generated queries per document. Right: Acc@1 vs. model size.
  • ...and 2 more figures