Table of Contents
Fetching ...

Does Generative Retrieval Overcome the Limitations of Dense Retrieval?

Yingchen Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng

TL;DR

This work contrasts generative retrieval (GR) with dense retrieval (DR) along learning objectives and representational capacity, showing that GR uses globally normalized likelihood while DR relies on local normalization, and that GR can memorize a corpus via its full parameterization rather than compressing relevance into low-dimensional embeddings. The authors prove that DR suffers calibration drift that grows with corpus size due to local normalization and a low-rank bottleneck in bilinear scoring, whereas GR can, in principle, approximate the true query–document posterior mapping given sufficient capacity and a sensible docid scheme. Empirically on Natural Questions and MS MARCO, GR demonstrates scaling-friendly behavior (better gains with larger models and corpora) but does not universally outperform DR in practice, with performance still contingent on docid design, decoding constraints, and data quality. The paper concludes with practical directions to close the gap, including pretraining relevance targets, coarse-to-fine architectures, and hybrid systems that combine GR's generative strengths with DR's efficiency, aiming for robust, scalable generative retrieval solutions.

Abstract

Generative retrieval (GR) has emerged as a new paradigm in neural information retrieval, offering an alternative to dense retrieval (DR) by directly generating identifiers of relevant documents. In this paper, we theoretically and empirically investigate how GR fundamentally diverges from DR in both learning objectives and representational capacity. GR performs globally normalized maximum-likelihood optimization and encodes corpus and relevance information directly in the model parameters, whereas DR adopts locally normalized objectives and represents the corpus with external embeddings before computing similarity via a bilinear interaction. Our analysis suggests that, under scaling, GR can overcome the inherent limitations of DR, yielding two major benefits. First, with larger corpora, GR avoids the sharp performance degradation caused by the optimization drift induced by DR's local normalization. Second, with larger models, GR's representational capacity scales with parameter size, unconstrained by the global low-rank structure that limits DR. We validate these theoretical insights through controlled experiments on the Natural Questions and MS MARCO datasets, across varying negative sampling strategies, embedding dimensions, and model scales. But despite its theoretical advantages, GR does not universally outperform DR in practice. We outline directions to bridge the gap between GR's theoretical potential and practical performance, providing guidance for future research in scalable and robust generative retrieval.

Does Generative Retrieval Overcome the Limitations of Dense Retrieval?

TL;DR

This work contrasts generative retrieval (GR) with dense retrieval (DR) along learning objectives and representational capacity, showing that GR uses globally normalized likelihood while DR relies on local normalization, and that GR can memorize a corpus via its full parameterization rather than compressing relevance into low-dimensional embeddings. The authors prove that DR suffers calibration drift that grows with corpus size due to local normalization and a low-rank bottleneck in bilinear scoring, whereas GR can, in principle, approximate the true query–document posterior mapping given sufficient capacity and a sensible docid scheme. Empirically on Natural Questions and MS MARCO, GR demonstrates scaling-friendly behavior (better gains with larger models and corpora) but does not universally outperform DR in practice, with performance still contingent on docid design, decoding constraints, and data quality. The paper concludes with practical directions to close the gap, including pretraining relevance targets, coarse-to-fine architectures, and hybrid systems that combine GR's generative strengths with DR's efficiency, aiming for robust, scalable generative retrieval solutions.

Abstract

Generative retrieval (GR) has emerged as a new paradigm in neural information retrieval, offering an alternative to dense retrieval (DR) by directly generating identifiers of relevant documents. In this paper, we theoretically and empirically investigate how GR fundamentally diverges from DR in both learning objectives and representational capacity. GR performs globally normalized maximum-likelihood optimization and encodes corpus and relevance information directly in the model parameters, whereas DR adopts locally normalized objectives and represents the corpus with external embeddings before computing similarity via a bilinear interaction. Our analysis suggests that, under scaling, GR can overcome the inherent limitations of DR, yielding two major benefits. First, with larger corpora, GR avoids the sharp performance degradation caused by the optimization drift induced by DR's local normalization. Second, with larger models, GR's representational capacity scales with parameter size, unconstrained by the global low-rank structure that limits DR. We validate these theoretical insights through controlled experiments on the Natural Questions and MS MARCO datasets, across varying negative sampling strategies, embedding dimensions, and model scales. But despite its theoretical advantages, GR does not universally outperform DR in practice. We outline directions to bridge the gap between GR's theoretical potential and practical performance, providing guidance for future research in scalable and robust generative retrieval.

Paper Structure

This paper contains 25 sections, 4 theorems, 16 equations, 5 figures, 3 tables.

Key Result

Theorem 3.1

Let $\widetilde{P}_\Theta(d\mid q)$ be the full-softmax distribution. Under the assumptions above, the expected gap satisfies the following condition: where $N=\lvert\mathcal{D}\rvert$ and $K$ is the batch candidate size.

Figures (5)

  • Figure 1: DR's retrieval performance improves as the number of negative samples increases. The left $y$-axis shows retrieval metrics (higher is better), while the right $y$-axis shows the Brier score (lower is better). The plotted Brier values are raw and thus not comparable across different settings.
  • Figure 2: DR's retrieval performance improves as the embedding dimension increases.
  • Figure 3: Comparison of DR and GR under synchronized model scaling. Only the increasing range is shown here. All models drop after 0.4B due to adding too many new parameters. See Appendix \ref{['app:model_scaling']} for the full curve.
  • Figure 4: Extended results of corpus scaling.
  • Figure 5: Extended results of model scaling.

Theorems & Definitions (4)

  • Theorem 3.1: Lower bound under local normalization
  • Proposition 3.2: Global normalization and calibration of GR
  • Corollary 3.3: Low-rank bottleneck of bilinear DR
  • Theorem 3.4: Approximation of $P^\star$ by GR