IncDSI: Incrementally Updatable Document Retrieval
Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, Kilian Q. Weinberger
TL;DR
IncDSI addresses the challenge of updating differentiable search index models in real time by updating only the vector for a newly added document. It formulates this as a constrained optimization over the document-vector space, balancing retrieval of the new document with preservation of existing performance via a loss $\mathcal{L}(\boldsymbol{v}_{n+1}) = \lambda_1 \ell_1(\boldsymbol{v}_{n+1}) + (1-\lambda_1) \ell_2(\boldsymbol{v}_{n+1}) + \lambda_2 \|\boldsymbol{v}_{n+1}\|_2^2$, where $\ell_1$ and $\ell_2$ encode the dual constraints on new and old documents. The approach uses a BERT-based encoder with a classification layer, indexes documents via generated queries, and updates the index with L-BFGS, achieving updates in about $50$ ms per document while maintaining strong retrieval for both old and new content. Experiments on Natural Questions 320K and MS MARCO demonstrate that IncDSI can index thousands of documents much faster than retraining, with comparable or superior performance for new documents and modest degradation for old ones that could be mitigated with further retraining. The work shows a practical path toward streaming, updatable neural IR systems, complementing full retraining for large-scale changes and enabling real-time access to newly added information.
Abstract
Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 20-50ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with re-training the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in real-time. Our code for IncDSI is available at https://github.com/varshakishore/IncDSI.
