DSI++: Updating Transformer Memory with New Documents

Sanket Vaibhav Mehta; Jai Gupta; Yi Tay; Mostafa Dehghani; Vinh Q. Tran; Jinfeng Rao; Marc Najork; Emma Strubell; Donald Metzler

DSI++: Updating Transformer Memory with New Documents

Sanket Vaibhav Mehta, Jai Gupta, Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Jinfeng Rao, Marc Najork, Emma Strubell, Donald Metzler

TL;DR

DSI++ tackles the challenge of updating Differentiable Search Indices as corpora grow, addressing catastrophic forgetting during continual indexing. The authors propose two mitigation strategies—Sharpness-Aware Minimization to promote flatter minima and a generative memory for pseudo-queries—plus an episodic/continual learning framework and benchmarks derived from Natural Questions and MS MARCO. Empirical results show substantial reductions in forgetting and meaningful forward transfer to new documents, with up to ~6x fewer updates than retraining from scratch and consistent improvements across datasets and scales. The work advances practical deployment of DSIs in dynamic environments and contributes to the broader understanding of memory-preserving strategies in continual learning for language models.

Abstract

Differentiable Search Indices (DSIs) encode a corpus of documents in model parameters and use the same model to answer user queries directly. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents ($+12\%$). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting significantly. Concretely, it improves the average Hits@10 by $+21.1\%$ over competitive baselines for NQ and requires $6$ times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence.

DSI++: Updating Transformer Memory with New Documents

TL;DR

Abstract

). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting significantly. Concretely, it improves the average Hits@10 by

over competitive baselines for NQ and requires

times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence.

Paper Structure (38 sections, 1 equation, 8 figures, 3 tables)

This paper contains 38 sections, 1 equation, 8 figures, 3 tables.

Introduction
DSI++: Continual learning challenge for DSI
Problem setup
Goal:
Benchmarks for DSI++
Evaluation Metrics
Case study: Forgetting and Forward Transfer
Forgetting.
Forward transfer.
Docid representations.
Model scale.
Implicit Forgetting: SAM
Forgetting events.
Flatness and forgetting.
Sharpness-Aware Minimization.
...and 23 more sections

Figures (8)

Figure 1: Indexing accuracy of $D_0, D_1,$ and $D_2$ document corpora visualized as we continuously index new documents (averaged over $3$ runs). We observe that continual indexing of new documents leads to severe forgetting of the previously memorized documents.
Figure 2: Systematic study about forgetting and forward transfer when incrementally indexing new corpus of documents across different model sizes (T5-Base, T5-Large, T5-XL) and docid representations. We use atomic docids by default and denote (N)/(S) for naively/ semantically structured docids. $\uparrow$ indicates higher is better, $\downarrow$ indicates lower is better. All results are averaged over $3$ runs. We observe that the average $A_n$ and learning $LA_n$ performance improves by increasing the model scale. However, forgetting $F_n$ is severe across all model scales. Next, we observe that naively structured docids, T5-Base(N), underperform unstructured atomic docids, T5-Base, across all metrics - indexing accuracy, Hits@1, (see Figure \ref{['fig:case_study_forgetting_hits10']} in Appendix for Hits@10 results). Imbuing the docid space with a semantic (S) structure alleviates the forgetting compared to an arbitrary/ naive (N) structure.
Figure 3: Investigating the effectiveness of SAM for alleviating implicit forgetting in the T5-Base model by visualizing cumulative histogram of forgetting events. A forgetting event toneva2018empirical is defined when an individual document goes from being classified correctly to incorrectly over the course of memorization. SAM increases the percentage of examples experiencing zero forgetting events by absolute $12\%$ over Adafactor.
Figure 4: Investigating the effectiveness of generative memory in mitigating forgetting when continuously indexing new corpus $D_n$ (T5-Base model and atomic docids representation) for the NQ dataset. $\uparrow$ indicates higher is better, $\downarrow$ indicates lower is better. We observe that continual indexing of old and new documents cl($U_n$) helps to alleviate forgetting of older documents when evaluated on retrieval tasks. However, average Hits@10 ($A_n$) still undergo $23$ points drop after sequential updates ($D_0 \rightarrow D_1 \cdots \rightarrow D_5$). Generative memory enables sparse replaying of pseudo-queries for old documents and continual semi-supervised learning with new documents. We observe that augmenting generative memory during continual indexing not only reduces the forgetting ($F_n$) but also improves average Hits@10 ($A_n$) by $+21.1\%$ over considered baselines (see Figure \ref{['fig:generative_memory_hits1']} for Hits@1 results. Figure \ref{['fig:generative_memory_msmarco']} for MS MARCO results in the Appendix).
Figure 5: Investigating the effectiveness of SAM for alleviating implicit forgetting in the T5-Base model by visualizing indexing accuracy during memorization. We observe serious fluctuations in the indexing accuracy in the case of the Adafactor optimizer, thereby suggesting unstable memorization. SAM leads to relatively stable memorization of documents.
...and 3 more figures

DSI++: Updating Transformer Memory with New Documents

TL;DR

Abstract

DSI++: Updating Transformer Memory with New Documents

Authors

TL;DR

Abstract

Table of Contents

Figures (8)