Table of Contents
Fetching ...

Advancing continual lifelong learning in neural information retrieval: definition, dataset, framework, and empirical evaluation

Jingrui Hou, Georgina Cosma, Axel Finke

TL;DR

This work formalizes continual neural information retrieval (NIR) as a sequential task framework, introduces Topic-MSMARCO to benchmark continual IR, and presents the CLNIR framework that decouples model choice from learning strategy. By evaluating combinations of embedding-based and pretraining-based IR models with regularization and replay strategies, the study demonstrates that appropriate strategies substantially reduce forgetting and improve performance on past tasks, with pretraining-based models showing the strongest gains and stability under topic shifts. Key findings reveal that topic shift and data augmentation more strongly degrade embedding-based models, while learning strategies can mitigate these effects; however, strategy effectiveness is model-dependent. The proposed framework and dataset provide a practical pathway for building IR systems capable of continual adaptation, while highlighting limitations and avenues for future work, including novel strategies and multimodal extensions.

Abstract

Continual learning refers to the capability of a machine learning model to learn and adapt to new information, without compromising its performance on previously learned tasks. Although several studies have investigated continual learning methods for information retrieval tasks, a well-defined task formulation is still lacking, and it is unclear how typical learning strategies perform in this context. To address this challenge, a systematic task formulation of continual neural information retrieval is presented, along with a multiple-topic dataset that simulates continuous information retrieval. A comprehensive continual neural information retrieval framework consisting of typical retrieval models and continual learning strategies is then proposed. Empirical evaluations illustrate that the proposed framework can successfully prevent catastrophic forgetting in neural information retrieval and enhance performance on previously learned tasks. The results indicate that embedding-based retrieval models experience a decline in their continual learning performance as the topic shift distance and dataset volume of new tasks increase. In contrast, pretraining-based models do not show any such correlation. Adopting suitable learning strategies can mitigate the effects of topic shift and data augmentation.

Advancing continual lifelong learning in neural information retrieval: definition, dataset, framework, and empirical evaluation

TL;DR

This work formalizes continual neural information retrieval (NIR) as a sequential task framework, introduces Topic-MSMARCO to benchmark continual IR, and presents the CLNIR framework that decouples model choice from learning strategy. By evaluating combinations of embedding-based and pretraining-based IR models with regularization and replay strategies, the study demonstrates that appropriate strategies substantially reduce forgetting and improve performance on past tasks, with pretraining-based models showing the strongest gains and stability under topic shifts. Key findings reveal that topic shift and data augmentation more strongly degrade embedding-based models, while learning strategies can mitigate these effects; however, strategy effectiveness is model-dependent. The proposed framework and dataset provide a practical pathway for building IR systems capable of continual adaptation, while highlighting limitations and avenues for future work, including novel strategies and multimodal extensions.

Abstract

Continual learning refers to the capability of a machine learning model to learn and adapt to new information, without compromising its performance on previously learned tasks. Although several studies have investigated continual learning methods for information retrieval tasks, a well-defined task formulation is still lacking, and it is unclear how typical learning strategies perform in this context. To address this challenge, a systematic task formulation of continual neural information retrieval is presented, along with a multiple-topic dataset that simulates continuous information retrieval. A comprehensive continual neural information retrieval framework consisting of typical retrieval models and continual learning strategies is then proposed. Empirical evaluations illustrate that the proposed framework can successfully prevent catastrophic forgetting in neural information retrieval and enhance performance on previously learned tasks. The results indicate that embedding-based retrieval models experience a decline in their continual learning performance as the topic shift distance and dataset volume of new tasks increase. In contrast, pretraining-based models do not show any such correlation. Adopting suitable learning strategies can mitigate the effects of topic shift and data augmentation.
Paper Structure (29 sections, 17 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 17 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: An example diagram of continual neural information retrieval with three tasks ($T = 3$). The neural information retrieval model is initially trained on the training set of task 1 and then tested on all three test sets to generate $P_{1,1}$, $P_{1,2}$, and $P_{1,3}$. After training on the set for task 2, $P_{2,1}$, $P_{2,2}$, and $P_{2,3}$ will be generated. Finally, upon completion of task 3, the model will produce $P_{3,1}$, $P_{3,2}$, and $P_{3,3}$.
  • Figure 2: The framework architecture.The models and learning strategies mentioned will be explained in Sections \ref{['section.nir_models']} and \ref{['section.learning_strategies']} respectively.
  • Figure 3: Architectures of word embedding based models.
  • Figure 4: Architectures of pre-trained models.
  • Figure 5: Pairwise task distances in Topic-MSMARCO. A larger value denotes a larger semantic distance.
  • ...and 3 more figures