Table of Contents
Fetching ...

IncDSI: Incrementally Updatable Document Retrieval

Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, Kilian Q. Weinberger

TL;DR

IncDSI addresses the challenge of updating differentiable search index models in real time by updating only the vector for a newly added document. It formulates this as a constrained optimization over the document-vector space, balancing retrieval of the new document with preservation of existing performance via a loss $\mathcal{L}(\boldsymbol{v}_{n+1}) = \lambda_1 \ell_1(\boldsymbol{v}_{n+1}) + (1-\lambda_1) \ell_2(\boldsymbol{v}_{n+1}) + \lambda_2 \|\boldsymbol{v}_{n+1}\|_2^2$, where $\ell_1$ and $\ell_2$ encode the dual constraints on new and old documents. The approach uses a BERT-based encoder with a classification layer, indexes documents via generated queries, and updates the index with L-BFGS, achieving updates in about $50$ ms per document while maintaining strong retrieval for both old and new content. Experiments on Natural Questions 320K and MS MARCO demonstrate that IncDSI can index thousands of documents much faster than retraining, with comparable or superior performance for new documents and modest degradation for old ones that could be mitigated with further retraining. The work shows a practical path toward streaming, updatable neural IR systems, complementing full retraining for large-scale changes and enabling real-time access to newly added information.

Abstract

Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 20-50ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with re-training the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in real-time. Our code for IncDSI is available at https://github.com/varshakishore/IncDSI.

IncDSI: Incrementally Updatable Document Retrieval

TL;DR

IncDSI addresses the challenge of updating differentiable search index models in real time by updating only the vector for a newly added document. It formulates this as a constrained optimization over the document-vector space, balancing retrieval of the new document with preservation of existing performance via a loss , where and encode the dual constraints on new and old documents. The approach uses a BERT-based encoder with a classification layer, indexes documents via generated queries, and updates the index with L-BFGS, achieving updates in about ms per document while maintaining strong retrieval for both old and new content. Experiments on Natural Questions 320K and MS MARCO demonstrate that IncDSI can index thousands of documents much faster than retraining, with comparable or superior performance for new documents and modest degradation for old ones that could be mitigated with further retraining. The work shows a practical path toward streaming, updatable neural IR systems, complementing full retraining for large-scale changes and enabling real-time access to newly added information.

Abstract

Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 20-50ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with re-training the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in real-time. Our code for IncDSI is available at https://github.com/varshakishore/IncDSI.
Paper Structure (31 sections, 6 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 31 sections, 6 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of our proposed setting. IncDSI can index incoming documents immediately and begin serving them to users.
  • Figure 2: An illustration of the process of adding a new document (shown in purple) with its associated queries. The queries are embedded using the encoder trained on initial documents. A single document vector is optimized to be closer to the query embeddings (all other document vectors are fixed).
  • Figure 3: Time taken to add documents for different methods. Numbers on the bars are hit@1 for new documents. Lighter shades in stacked bars indicate later checkpoints (epochs 1,5,10). DPR, which only requires embedding queries and computing inner products, is not shown because it uses a model trained on just the original data and results in worse performance (when compared to the models here).
  • Figure 4: We present the retrieval performance for the original documents and new documents as increasing numbers of documents are indexed. The IncDSI performance represents the average over 10 random document orderings.