Table of Contents
Fetching ...

How Does Generative Retrieval Scale to Millions of Passages?

Ronak Pradeep, Kai Hui, Jai Gupta, Adam D. Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, Vinh Q. Tran

TL;DR

This work provides the first large-scale empirical evaluation of generative retrieval across up to 8.8 million passages (MS MARCO FULL), systematically abling inputs, targets, and model variants. It finds synthetic query generation to be the central driver of effectiveness at scale, while many architectural tweaks offer limited gains once compute costs are considered. Increasing model size helps only under certain configurations and can even hurt with very large parameter budgets, especially for sequential identifiers, suggesting that naive, synthetic-query–driven indexing with reasonably scaled models currently yields the best practical performance. Overall, the results indicate that while generative retrieval can approach dense dual encoders on small corpora, achieving competitive, scalable performance on millions of passages requires fundamental advances beyond simple scaling.

Abstract

Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first empirical study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. We uncover several findings about scaling generative retrieval to millions of passages; notably, the central importance of using synthetic queries as document representations during indexing, the ineffectiveness of existing proposed architecture modifications when accounting for compute cost, and the limits of naively scaling model parameters with respect to retrieval performance. While we find that generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge. We believe these findings will be valuable for the community to clarify the current state of generative retrieval, highlight the unique challenges, and inspire new research directions.

How Does Generative Retrieval Scale to Millions of Passages?

TL;DR

This work provides the first large-scale empirical evaluation of generative retrieval across up to 8.8 million passages (MS MARCO FULL), systematically abling inputs, targets, and model variants. It finds synthetic query generation to be the central driver of effectiveness at scale, while many architectural tweaks offer limited gains once compute costs are considered. Increasing model size helps only under certain configurations and can even hurt with very large parameter budgets, especially for sequential identifiers, suggesting that naive, synthetic-query–driven indexing with reasonably scaled models currently yields the best practical performance. Overall, the results indicate that while generative retrieval can approach dense dual encoders on small corpora, achieving competitive, scalable performance on millions of passages requires fundamental advances beyond simple scaling.

Abstract

Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first empirical study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. We uncover several findings about scaling generative retrieval to millions of passages; notably, the central importance of using synthetic queries as document representations during indexing, the ineffectiveness of existing proposed architecture modifications when accounting for compute cost, and the limits of naively scaling model parameters with respect to retrieval performance. While we find that generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge. We believe these findings will be valuable for the community to clarify the current state of generative retrieval, highlight the unique challenges, and inspire new research directions.
Paper Structure (36 sections, 1 equation, 2 figures, 4 tables)

This paper contains 36 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Jaccard similarity between synthetic queries and validation set queries vs. MRR@10 on the MSMarco100K subset.
  • Figure 2: MSMarco100K MRR@10 as we vary the number of synthetic queries per passage. Given 100 pre-generated queries per passage, we compare random-k sampling, top-k selection via RankT5-XL, and using all 100 synthetic queries.