Table of Contents
Fetching ...

MURR: Model Updating with Regularized Replay for Searching a Document Stream

Eugene Yang, Nicola Tonellotto, Dawn Lawrie, Sean MacAvaney, James Mayfield, Douglas W. Oard, Scott Miller

TL;DR

The paper tackles nonstationary document and query streams by addressing how neural dense retrievers can adapt without reencoding vast archives. It proposes MURR, a session-based updating scheme with regularized replay that preserves compatibility with past document representations while fine-tuning on new content, including two variants: MURR-CF (continue from previous session) and MURR-LM (start from a language model). Empirical results in simulated streaming scenarios show that MURR-CF yields more effective and stable retrieval across sessions than baselines, with ablations confirming the necessity of both replay and representation regularization. The work demonstrates a practical, scalable approach for continuous neural retrieval in dynamic information environments, with potential impact on live search systems and evolving topic discovery.

Abstract

The Internet produces a continuous stream of new documents and user-generated queries. These naturally change over time based on events in the world and the evolution of language. Neural retrieval models that were trained once on a fixed set of query-document pairs will quickly start misrepresenting newly-created content and queries, leading to less effective retrieval. Traditional statistical sparse retrieval can update collection statistics to reflect these changes in the use of language in documents and queries. In contrast, continued fine-tuning of the language model underlying neural retrieval approaches such as DPR and ColBERT creates incompatibility with previously-encoded documents. Re-encoding and re-indexing all previously-processed documents can be costly. In this work, we explore updating a neural dual encoder retrieval model without reprocessing past documents in the stream. We propose MURR, a model updating strategy with regularized replay, to ensure the model can still faithfully search existing documents without reprocessing, while continuing to update the model for the latest topics. In our simulated streaming environments, we show that fine-tuning models using MURR leads to more effective and more consistent retrieval results than other strategies as the stream of documents and queries progresses.

MURR: Model Updating with Regularized Replay for Searching a Document Stream

TL;DR

The paper tackles nonstationary document and query streams by addressing how neural dense retrievers can adapt without reencoding vast archives. It proposes MURR, a session-based updating scheme with regularized replay that preserves compatibility with past document representations while fine-tuning on new content, including two variants: MURR-CF (continue from previous session) and MURR-LM (start from a language model). Empirical results in simulated streaming scenarios show that MURR-CF yields more effective and stable retrieval across sessions than baselines, with ablations confirming the necessity of both replay and representation regularization. The work demonstrates a practical, scalable approach for continuous neural retrieval in dynamic information environments, with potential impact on live search systems and evolving topic discovery.

Abstract

The Internet produces a continuous stream of new documents and user-generated queries. These naturally change over time based on events in the world and the evolution of language. Neural retrieval models that were trained once on a fixed set of query-document pairs will quickly start misrepresenting newly-created content and queries, leading to less effective retrieval. Traditional statistical sparse retrieval can update collection statistics to reflect these changes in the use of language in documents and queries. In contrast, continued fine-tuning of the language model underlying neural retrieval approaches such as DPR and ColBERT creates incompatibility with previously-encoded documents. Re-encoding and re-indexing all previously-processed documents can be costly. In this work, we explore updating a neural dual encoder retrieval model without reprocessing past documents in the stream. We propose MURR, a model updating strategy with regularized replay, to ensure the model can still faithfully search existing documents without reprocessing, while continuing to update the model for the latest topics. In our simulated streaming environments, we show that fine-tuning models using MURR leads to more effective and more consistent retrieval results than other strategies as the stream of documents and queries progresses.

Paper Structure

This paper contains 18 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Sampling distribution of each domain in each session and simulated stream. The color of the bars represents the session, along with the session ID marked at the bottom of each bar.
  • Figure 2: Success@5 of each model updating strategy where the x-axis is the running session ID. Values in parenthesis in the legend are macro-average of Success@5 across all query sets and streams. Differences between each strategy are all statistically significant using the paired t-test described in Section \ref{['sec:exp:evaluation']} except for the pair of CF w/o Replay and MURR-LM after multiple test corrections.
  • Figure 3: Query set breakdown of five model updating strategies in Scenario D2. Each subgraph is the effectiveness over sessions of the set of queries introduced in the session indicated in the title. Values in parenthesis are macro-average Success@5 on query sets in only D2. Values at the x-axis is the running session ID.
  • Figure 4: Ablation study on the number of replay triples and the regularization strength under MURR-CF in Scenario D2. The x-axis is the session ID. The dashed gray line is the setup (200 triples, $\alpha=0.01$) used in the main experiments. Using 0 replay triples is essentially CF w/o Replay in the main results.