Bottleneck-Minimal Indexing for Generative Document Retrieval
Xin Du, Lixin Xiu, Kumiko Tanaka-Ishii
TL;DR
This work reframes generative document retrieval (GDR) through information theory, treating the document index as a bottleneck that must carry information about both documents and queries. By applying rate-distortion theory and the information bottleneck, it derives Bottleneck-Minimal Indexing (BMI), which optimizes the index by considering the query distribution, not just the document space. The authors show, both theoretically and empirically, that BMI—implemented via hierarchical k-means clustering on mean query vectors $\mu_{Q|x}$ computed from GenQ/RealQ/DocSeg—reduces information distortion and improves Recall@1 on NQ320K and MARCO Lite, often outperforming distortion-focused baselines and competing with SOTA methods. The approach provides a simple, static indexing scheme with strong gains, especially for smaller models, and demonstrates the practical value of incorporating query statistics into indexing for GDR.
Abstract
We apply an information-theoretic perspective to reconsider generative document retrieval (GDR), in which a document $x \in X$ is indexed by $t \in T$, and a neural autoregressive model is trained to map queries $Q$ to $T$. GDR can be considered to involve information transmission from documents $X$ to queries $Q$, with the requirement to transmit more bits via the indexes $T$. By applying Shannon's rate-distortion theory, the optimality of indexing can be analyzed in terms of the mutual information, and the design of the indexes $T$ can then be regarded as a {\em bottleneck} in GDR. After reformulating GDR from this perspective, we empirically quantify the bottleneck underlying GDR. Finally, using the NQ320K and MARCO datasets, we evaluate our proposed bottleneck-minimal indexing method in comparison with various previous indexing methods, and we show that it outperforms those methods.
