Table of Contents
Fetching ...

Replication and Exploration of Generative Retrieval over Dynamic Corpora

Zhen Zhang, Xinyu Ma, Weiwei Sun, Pengjie Ren, Zhumin Chen, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Zhaochun Ren

TL;DR

This work investigates how generative retrieval (GR) models perform when document collections evolve over time. It systematically reproduces a range of GR approaches and reveals that text-based docids generalize better to unseen documents in dynamic corpora, while numeric-based docids tend to overfit to the initial corpus. Building on these insights, the authors propose MDGR, a multi-docid numeric framework that preserves efficiency yet mitigates semantic drift through constrained docid expansion and a multi-chunk design, achieving competitive performance without retraining. The findings highlight the importance of docid semantics and granularity for robust dynamic retrieval and offer practical guidelines for deploying GR in real-world, ever-changing corpora.

Abstract

Generative retrieval (GR) has emerged as a promising paradigm in information retrieval (IR). However, most existing GR models are developed and evaluated using a static document collection, and their performance in dynamic corpora where document collections evolve continuously is rarely studied. In this paper, we first reproduce and systematically evaluate various representative GR approaches over dynamic corpora. Through extensive experiments, we reveal that existing GR models with \textit{text-based} docids show superior generalization to unseen documents. We observe that the more fine-grained the docid design in the GR model, the better its performance over dynamic corpora, surpassing BM25 and even being comparable to dense retrieval methods. While GR models with \textit{numeric-based} docids show high efficiency, their performance drops significantly over dynamic corpora. Furthermore, our experiments find that the underperformance of numeric-based docids is partly due to their excessive tendency toward the initial document set, which likely results from overfitting on the training set. We then conduct an in-depth analysis of the best-performing GR methods. We identify three critical advantages of text-based docids in dynamic corpora: (i) Semantic alignment with language models' pretrained knowledge, (ii) Fine-grained docid design, and (iii) High lexical diversity. Building on these insights, we finally propose a novel multi-docid design that leverages both the efficiency of numeric-based docids and the effectiveness of text-based docids, achieving improved performance in dynamic corpus without requiring additional retraining. Our work offers empirical evidence for advancing GR methods over dynamic corpora and paves the way for developing more generalized yet efficient GR models in real-world search engines.

Replication and Exploration of Generative Retrieval over Dynamic Corpora

TL;DR

This work investigates how generative retrieval (GR) models perform when document collections evolve over time. It systematically reproduces a range of GR approaches and reveals that text-based docids generalize better to unseen documents in dynamic corpora, while numeric-based docids tend to overfit to the initial corpus. Building on these insights, the authors propose MDGR, a multi-docid numeric framework that preserves efficiency yet mitigates semantic drift through constrained docid expansion and a multi-chunk design, achieving competitive performance without retraining. The findings highlight the importance of docid semantics and granularity for robust dynamic retrieval and offer practical guidelines for deploying GR in real-world, ever-changing corpora.

Abstract

Generative retrieval (GR) has emerged as a promising paradigm in information retrieval (IR). However, most existing GR models are developed and evaluated using a static document collection, and their performance in dynamic corpora where document collections evolve continuously is rarely studied. In this paper, we first reproduce and systematically evaluate various representative GR approaches over dynamic corpora. Through extensive experiments, we reveal that existing GR models with \textit{text-based} docids show superior generalization to unseen documents. We observe that the more fine-grained the docid design in the GR model, the better its performance over dynamic corpora, surpassing BM25 and even being comparable to dense retrieval methods. While GR models with \textit{numeric-based} docids show high efficiency, their performance drops significantly over dynamic corpora. Furthermore, our experiments find that the underperformance of numeric-based docids is partly due to their excessive tendency toward the initial document set, which likely results from overfitting on the training set. We then conduct an in-depth analysis of the best-performing GR methods. We identify three critical advantages of text-based docids in dynamic corpora: (i) Semantic alignment with language models' pretrained knowledge, (ii) Fine-grained docid design, and (iii) High lexical diversity. Building on these insights, we finally propose a novel multi-docid design that leverages both the efficiency of numeric-based docids and the effectiveness of text-based docids, achieving improved performance in dynamic corpus without requiring additional retraining. Our work offers empirical evidence for advancing GR methods over dynamic corpora and paves the way for developing more generalized yet efficient GR models in real-world search engines.

Paper Structure

This paper contains 19 sections, 8 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: IDBI results for different retrieval methods on NQ dataset. Lower is better.
  • Figure 2: Hit@10 performance of n-gram docid and three fixed-position docid approaches on the NQ dataset.
  • Figure 3: Hit@10 performance of different decoding dimension of Ultron-PQ on NQ dataset

Theorems & Definitions (3)

  • Definition 1: Forgetting Metric $F_n$
  • Definition 2: Generalization Performance $GA_n$
  • Definition 3: Initial Document Bias Index (IDBI)