Bottleneck-Minimal Indexing for Generative Document Retrieval

Xin Du; Lixin Xiu; Kumiko Tanaka-Ishii

Bottleneck-Minimal Indexing for Generative Document Retrieval

Xin Du, Lixin Xiu, Kumiko Tanaka-Ishii

TL;DR

This work reframes generative document retrieval (GDR) through information theory, treating the document index as a bottleneck that must carry information about both documents and queries. By applying rate-distortion theory and the information bottleneck, it derives Bottleneck-Minimal Indexing (BMI), which optimizes the index by considering the query distribution, not just the document space. The authors show, both theoretically and empirically, that BMI—implemented via hierarchical k-means clustering on mean query vectors $\mu_{Q|x}$ computed from GenQ/RealQ/DocSeg—reduces information distortion and improves Recall@1 on NQ320K and MARCO Lite, often outperforming distortion-focused baselines and competing with SOTA methods. The approach provides a simple, static indexing scheme with strong gains, especially for smaller models, and demonstrates the practical value of incorporating query statistics into indexing for GDR.

Abstract

We apply an information-theoretic perspective to reconsider generative document retrieval (GDR), in which a document $x \in X$ is indexed by $t \in T$, and a neural autoregressive model is trained to map queries $Q$ to $T$. GDR can be considered to involve information transmission from documents $X$ to queries $Q$, with the requirement to transmit more bits via the indexes $T$. By applying Shannon's rate-distortion theory, the optimality of indexing can be analyzed in terms of the mutual information, and the design of the indexes $T$ can then be regarded as a {\em bottleneck} in GDR. After reformulating GDR from this perspective, we empirically quantify the bottleneck underlying GDR. Finally, using the NQ320K and MARCO datasets, we evaluate our proposed bottleneck-minimal indexing method in comparison with various previous indexing methods, and we show that it outperforms those methods.

Bottleneck-Minimal Indexing for Generative Document Retrieval

TL;DR

computed from GenQ/RealQ/DocSeg—reduces information distortion and improves Recall@1 on NQ320K and MARCO Lite, often outperforming distortion-focused baselines and competing with SOTA methods. The approach provides a simple, static indexing scheme with strong gains, especially for smaller models, and demonstrates the practical value of incorporating query statistics into indexing for GDR.

Abstract

We apply an information-theoretic perspective to reconsider generative document retrieval (GDR), in which a document

is indexed by

, and a neural autoregressive model is trained to map queries

. GDR can be considered to involve information transmission from documents

to queries

, with the requirement to transmit more bits via the indexes

. By applying Shannon's rate-distortion theory, the optimality of indexing can be analyzed in terms of the mutual information, and the design of the indexes

can then be regarded as a {\em bottleneck} in GDR. After reformulating GDR from this perspective, we empirically quantify the bottleneck underlying GDR. Finally, using the NQ320K and MARCO datasets, we evaluate our proposed bottleneck-minimal indexing method in comparison with various previous indexing methods, and we show that it outperforms those methods.

Paper Structure (35 sections, 3 theorems, 19 equations, 7 figures, 4 tables)

This paper contains 35 sections, 3 theorems, 19 equations, 7 figures, 4 tables.

Introduction
Related Works
Generative Document Retrieval
Discrete Representation Learning
Vector quantization (VQ).
Hash-based methods.
Rate-Distortion Theory
Bottleneck-Minimal Indexing
Mathematical Setting
Distortion Optimality
GDR Bottleneck
A Theoretical Solution to Optimality
Indexing Methods
Hierarchical Random Indexing (HRI)
Hierarchical $k$-Means Indexing (HKmI)
...and 20 more sections

Key Result

Proposition 1.1

Figures (7)

Figure 1: (a) Generative document retrieval (GDR) framework. (b) Our contribution using bottleneck-minimal indexing: (b-1) distortion-optimal indexing for documents $\mathcal{X}$; (b-2) optimal indexing for both documents and queries $\mathcal{Q}$.
Figure 2: Bottleneck curves
Figure 3: Experimental information bottleneck curves corresponding to Figure \ref{['fig:bottleneck-curve']}. (a) Empirical information curves for HKmI, measured with T5 models of different sizes. (b) Empirical information curves for different indexing methods, estimated on the NQ320K dataset with the T5-base model.
Figure 4: Rec@1 scores on the test set of NQ320K for document IDs generated by hierarchical $k$-means (blue) or random (red) clustering, for different sizes of the finetuned language model T5.
Figure 5: Performance enhancement by finetuning the docT5query model for generating improved GenQ documents.
...and 2 more figures

Theorems & Definitions (7)

Definition 3.1
Proposition 1.1
proof
Proposition 1.2
proof
Proposition 1.3
proof

Bottleneck-Minimal Indexing for Generative Document Retrieval

TL;DR

Abstract

Bottleneck-Minimal Indexing for Generative Document Retrieval

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (7)