Table of Contents
Fetching ...

Bottleneck-Minimal Indexing for Generative Document Retrieval

Xin Du, Lixin Xiu, Kumiko Tanaka-Ishii

TL;DR

This work reframes generative document retrieval (GDR) through information theory, treating the document index as a bottleneck that must carry information about both documents and queries. By applying rate-distortion theory and the information bottleneck, it derives Bottleneck-Minimal Indexing (BMI), which optimizes the index by considering the query distribution, not just the document space. The authors show, both theoretically and empirically, that BMI—implemented via hierarchical k-means clustering on mean query vectors $\mu_{Q|x}$ computed from GenQ/RealQ/DocSeg—reduces information distortion and improves Recall@1 on NQ320K and MARCO Lite, often outperforming distortion-focused baselines and competing with SOTA methods. The approach provides a simple, static indexing scheme with strong gains, especially for smaller models, and demonstrates the practical value of incorporating query statistics into indexing for GDR.

Abstract

We apply an information-theoretic perspective to reconsider generative document retrieval (GDR), in which a document $x \in X$ is indexed by $t \in T$, and a neural autoregressive model is trained to map queries $Q$ to $T$. GDR can be considered to involve information transmission from documents $X$ to queries $Q$, with the requirement to transmit more bits via the indexes $T$. By applying Shannon's rate-distortion theory, the optimality of indexing can be analyzed in terms of the mutual information, and the design of the indexes $T$ can then be regarded as a {\em bottleneck} in GDR. After reformulating GDR from this perspective, we empirically quantify the bottleneck underlying GDR. Finally, using the NQ320K and MARCO datasets, we evaluate our proposed bottleneck-minimal indexing method in comparison with various previous indexing methods, and we show that it outperforms those methods.

Bottleneck-Minimal Indexing for Generative Document Retrieval

TL;DR

This work reframes generative document retrieval (GDR) through information theory, treating the document index as a bottleneck that must carry information about both documents and queries. By applying rate-distortion theory and the information bottleneck, it derives Bottleneck-Minimal Indexing (BMI), which optimizes the index by considering the query distribution, not just the document space. The authors show, both theoretically and empirically, that BMI—implemented via hierarchical k-means clustering on mean query vectors computed from GenQ/RealQ/DocSeg—reduces information distortion and improves Recall@1 on NQ320K and MARCO Lite, often outperforming distortion-focused baselines and competing with SOTA methods. The approach provides a simple, static indexing scheme with strong gains, especially for smaller models, and demonstrates the practical value of incorporating query statistics into indexing for GDR.

Abstract

We apply an information-theoretic perspective to reconsider generative document retrieval (GDR), in which a document is indexed by , and a neural autoregressive model is trained to map queries to . GDR can be considered to involve information transmission from documents to queries , with the requirement to transmit more bits via the indexes . By applying Shannon's rate-distortion theory, the optimality of indexing can be analyzed in terms of the mutual information, and the design of the indexes can then be regarded as a {\em bottleneck} in GDR. After reformulating GDR from this perspective, we empirically quantify the bottleneck underlying GDR. Finally, using the NQ320K and MARCO datasets, we evaluate our proposed bottleneck-minimal indexing method in comparison with various previous indexing methods, and we show that it outperforms those methods.
Paper Structure (35 sections, 3 theorems, 19 equations, 7 figures, 4 tables)

This paper contains 35 sections, 3 theorems, 19 equations, 7 figures, 4 tables.

Key Result

Proposition 1.1

Figures (7)

  • Figure 1: (a) Generative document retrieval (GDR) framework. (b) Our contribution using bottleneck-minimal indexing: (b-1) distortion-optimal indexing for documents $\mathcal{X}$; (b-2) optimal indexing for both documents and queries $\mathcal{Q}$.
  • Figure 2: Bottleneck curves
  • Figure 3: Experimental information bottleneck curves corresponding to Figure \ref{['fig:bottleneck-curve']}. (a) Empirical information curves for HKmI, measured with T5 models of different sizes. (b) Empirical information curves for different indexing methods, estimated on the NQ320K dataset with the T5-base model.
  • Figure 4: Rec@1 scores on the test set of NQ320K for document IDs generated by hierarchical $k$-means (blue) or random (red) clustering, for different sizes of the finetuned language model T5.
  • Figure 5: Performance enhancement by finetuning the docT5query model for generating improved GenQ documents.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Definition 3.1
  • Proposition 1.1
  • proof
  • Proposition 1.2
  • proof
  • Proposition 1.3
  • proof