MURR: Model Updating with Regularized Replay for Searching a Document Stream
Eugene Yang, Nicola Tonellotto, Dawn Lawrie, Sean MacAvaney, James Mayfield, Douglas W. Oard, Scott Miller
TL;DR
The paper tackles nonstationary document and query streams by addressing how neural dense retrievers can adapt without reencoding vast archives. It proposes MURR, a session-based updating scheme with regularized replay that preserves compatibility with past document representations while fine-tuning on new content, including two variants: MURR-CF (continue from previous session) and MURR-LM (start from a language model). Empirical results in simulated streaming scenarios show that MURR-CF yields more effective and stable retrieval across sessions than baselines, with ablations confirming the necessity of both replay and representation regularization. The work demonstrates a practical, scalable approach for continuous neural retrieval in dynamic information environments, with potential impact on live search systems and evolving topic discovery.
Abstract
The Internet produces a continuous stream of new documents and user-generated queries. These naturally change over time based on events in the world and the evolution of language. Neural retrieval models that were trained once on a fixed set of query-document pairs will quickly start misrepresenting newly-created content and queries, leading to less effective retrieval. Traditional statistical sparse retrieval can update collection statistics to reflect these changes in the use of language in documents and queries. In contrast, continued fine-tuning of the language model underlying neural retrieval approaches such as DPR and ColBERT creates incompatibility with previously-encoded documents. Re-encoding and re-indexing all previously-processed documents can be costly. In this work, we explore updating a neural dual encoder retrieval model without reprocessing past documents in the stream. We propose MURR, a model updating strategy with regularized replay, to ensure the model can still faithfully search existing documents without reprocessing, while continuing to update the model for the latest topics. In our simulated streaming environments, we show that fine-tuning models using MURR leads to more effective and more consistent retrieval results than other strategies as the stream of documents and queries progresses.
