Does Generative Retrieval Overcome the Limitations of Dense Retrieval?
Yingchen Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng
TL;DR
This work contrasts generative retrieval (GR) with dense retrieval (DR) along learning objectives and representational capacity, showing that GR uses globally normalized likelihood while DR relies on local normalization, and that GR can memorize a corpus via its full parameterization rather than compressing relevance into low-dimensional embeddings. The authors prove that DR suffers calibration drift that grows with corpus size due to local normalization and a low-rank bottleneck in bilinear scoring, whereas GR can, in principle, approximate the true query–document posterior mapping given sufficient capacity and a sensible docid scheme. Empirically on Natural Questions and MS MARCO, GR demonstrates scaling-friendly behavior (better gains with larger models and corpora) but does not universally outperform DR in practice, with performance still contingent on docid design, decoding constraints, and data quality. The paper concludes with practical directions to close the gap, including pretraining relevance targets, coarse-to-fine architectures, and hybrid systems that combine GR's generative strengths with DR's efficiency, aiming for robust, scalable generative retrieval solutions.
Abstract
Generative retrieval (GR) has emerged as a new paradigm in neural information retrieval, offering an alternative to dense retrieval (DR) by directly generating identifiers of relevant documents. In this paper, we theoretically and empirically investigate how GR fundamentally diverges from DR in both learning objectives and representational capacity. GR performs globally normalized maximum-likelihood optimization and encodes corpus and relevance information directly in the model parameters, whereas DR adopts locally normalized objectives and represents the corpus with external embeddings before computing similarity via a bilinear interaction. Our analysis suggests that, under scaling, GR can overcome the inherent limitations of DR, yielding two major benefits. First, with larger corpora, GR avoids the sharp performance degradation caused by the optimization drift induced by DR's local normalization. Second, with larger models, GR's representational capacity scales with parameter size, unconstrained by the global low-rank structure that limits DR. We validate these theoretical insights through controlled experiments on the Natural Questions and MS MARCO datasets, across varying negative sampling strategies, embedding dimensions, and model scales. But despite its theoretical advantages, GR does not universally outperform DR in practice. We outline directions to bridge the gap between GR's theoretical potential and practical performance, providing guidance for future research in scalable and robust generative retrieval.
