Table of Contents
Fetching ...

A Neural Corpus Indexer for Document Retrieval

Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Hao Sun, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, Xing Xie, Hao Allen Sun, Weiwei Deng, Qi Zhang, Mao Yang

TL;DR

The paper tackles recall limitations in traditional document retrieval by proposing the Neural Corpus Indexer (NCI), an end-to-end seq2seq model that directly generates semantic document identifiers for a query. It introduces a Prefix-Aware Weight-Adaptive (PAWA) decoder, hierarchical semantic identifiers via hierarchical $k$-means, and training techniques including query generation and consistency-based regularization, enabling effective end-to-end retrieval with constrained beam search. Empirical results on NQ320k and TriviaQA show substantial improvements over state-of-the-art baselines, with ensemble variants delivering the strongest gains. The work suggests a viable direction toward unified neural retrieval systems that combine indexing and retrieval in a single differentiable framework.

Abstract

Current state-of-the-art document retrieval solutions mainly follow an index-retrieve paradigm, where the index is hard to be directly optimized for the final retrieval target. In this paper, we aim to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. To this end, we propose Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query. To optimize the recall performance of NCI, we invent a prefix-aware weight-adaptive decoder architecture, and leverage tailored techniques including query generation, semantic document identifiers, and consistency-based regularization. Empirical studies demonstrated the superiority of NCI on two commonly used academic benchmarks, achieving +21.4% and +16.8% relative enhancement for Recall@1 on NQ320k dataset and R-Precision on TriviaQA dataset, respectively, compared to the best baseline method.

A Neural Corpus Indexer for Document Retrieval

TL;DR

The paper tackles recall limitations in traditional document retrieval by proposing the Neural Corpus Indexer (NCI), an end-to-end seq2seq model that directly generates semantic document identifiers for a query. It introduces a Prefix-Aware Weight-Adaptive (PAWA) decoder, hierarchical semantic identifiers via hierarchical -means, and training techniques including query generation and consistency-based regularization, enabling effective end-to-end retrieval with constrained beam search. Empirical results on NQ320k and TriviaQA show substantial improvements over state-of-the-art baselines, with ensemble variants delivering the strongest gains. The work suggests a viable direction toward unified neural retrieval systems that combine indexing and retrieval in a single differentiable framework.

Abstract

Current state-of-the-art document retrieval solutions mainly follow an index-retrieve paradigm, where the index is hard to be directly optimized for the final retrieval target. In this paper, we aim to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. To this end, we propose Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query. To optimize the recall performance of NCI, we invent a prefix-aware weight-adaptive decoder architecture, and leverage tailored techniques including query generation, semantic document identifiers, and consistency-based regularization. Empirical studies demonstrated the superiority of NCI on two commonly used academic benchmarks, achieving +21.4% and +16.8% relative enhancement for Recall@1 on NQ320k dataset and R-Precision on TriviaQA dataset, respectively, compared to the best baseline method.
Paper Structure (28 sections, 6 equations, 4 figures, 6 tables, 2 algorithms)

This paper contains 28 sections, 6 equations, 4 figures, 6 tables, 2 algorithms.

Figures (4)

  • Figure 1: Overview of Neural Corpus Indexer (NCI). (a) Preprocessing. Each document is represented by a semantic identifier via hierarchical $k$-means. (b) Query Generation. Queries are generated for each document based on the content. (c) The training pipeline of NCI. The model is trained over augmented <query, docid> pairs through a standard transformer encoder and the proposed Prefix-Aware Weight-Adaptive (PAWA) Decoder.
  • Figure 2: Overview of the Prefix-Aware Weight-Adaptive (PAWA) Decoder.
  • Figure 3: Learning curves of NCI with different model capacities. Left: NQ320$k$; Right: TriviaQA.
  • Figure 4: Analyses of retrieved documents with semantic identifiers. Left: The probabilities of retrieved documents for Query Group A; Middle: The probabilities for Query Group B; Right: The t-SNE visualization of BERT-based document embeddings.