Table of Contents
Fetching ...

Watch Your Step: Optimal Retrieval for Continual Learning at Scale

Truman Hickok, Dhireesha Kudithipudi

TL;DR

Watch Your Step investigates selective retrieval for replay in continual learning at scale, formalizing class- and sample-level primitives and evaluating their combinations on a large, pre-trained open-vocabulary detector (OWL-ViT). The study finds that simple, well-tuned primitives with proper deduplication and consistent replay pressure often outperform complex hybrids, while loss-adaptive replay can degrade performance if applied too aggressively. It demonstrates gradient-free replay-buffer construction via loss thresholds and class balancing yields strong gains, and that dataset characteristics (notably size) strongly influence forgetting dynamics. The work provides practical guidelines for scalable continual learning pipelines and highlights the delicate balance between selectivity, diversity, and replay budget in open-world perception tasks.

Abstract

In continual learning, a model learns incrementally over time while minimizing interference between old and new tasks. One of the most widely used approaches in continual learning is referred to as replay. Replay methods support interleaved learning by storing past experiences in a replay buffer. Although there are methods for selectively constructing the buffer and reprocessing its contents, there is limited exploration of the problem of selectively retrieving samples from the buffer. Current solutions have been tested in limited settings and, more importantly, in isolation. Existing work has also not explored the impact of duplicate replays on performance. In this work, we propose a framework for evaluating selective retrieval strategies, categorized by simple, independent class- and sample-selective primitives. We evaluated several combinations of existing strategies for selective retrieval and present their performances. Furthermore, we propose a set of strategies to prevent duplicate replays and explore whether new samples with low loss values can be learned without replay. In an effort to match our problem setting to a realistic continual learning pipeline, we restrict our experiments to a setting involving a large, pre-trained, open vocabulary object detection model, which is fully fine-tuned on a sequence of 15 datasets.

Watch Your Step: Optimal Retrieval for Continual Learning at Scale

TL;DR

Watch Your Step investigates selective retrieval for replay in continual learning at scale, formalizing class- and sample-level primitives and evaluating their combinations on a large, pre-trained open-vocabulary detector (OWL-ViT). The study finds that simple, well-tuned primitives with proper deduplication and consistent replay pressure often outperform complex hybrids, while loss-adaptive replay can degrade performance if applied too aggressively. It demonstrates gradient-free replay-buffer construction via loss thresholds and class balancing yields strong gains, and that dataset characteristics (notably size) strongly influence forgetting dynamics. The work provides practical guidelines for scalable continual learning pipelines and highlights the delicate balance between selectivity, diversity, and replay budget in open-world perception tasks.

Abstract

In continual learning, a model learns incrementally over time while minimizing interference between old and new tasks. One of the most widely used approaches in continual learning is referred to as replay. Replay methods support interleaved learning by storing past experiences in a replay buffer. Although there are methods for selectively constructing the buffer and reprocessing its contents, there is limited exploration of the problem of selectively retrieving samples from the buffer. Current solutions have been tested in limited settings and, more importantly, in isolation. Existing work has also not explored the impact of duplicate replays on performance. In this work, we propose a framework for evaluating selective retrieval strategies, categorized by simple, independent class- and sample-selective primitives. We evaluated several combinations of existing strategies for selective retrieval and present their performances. Furthermore, we propose a set of strategies to prevent duplicate replays and explore whether new samples with low loss values can be learned without replay. In an effort to match our problem setting to a realistic continual learning pipeline, we restrict our experiments to a setting involving a large, pre-trained, open vocabulary object detection model, which is fully fine-tuned on a sequence of 15 datasets.
Paper Structure (28 sections, 12 equations, 8 figures, 9 tables)

This paper contains 28 sections, 12 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: The problem settings of continual learning research. Top: the original and most common setup, where dataset is divided into N balanced subsets and the model is trained sequentially on each subset. Middle: another common setup, which is the same as above except the model is pre-trained; forgetting on pre-training tasks is ignored. Bottom: our setup, where a pre-trained model is sequentially trained on OOD datasets and forgetting on pre-training tasks is minimized.
  • Figure 2: An overview of the contents of our replay buffer. Each image is stored with its top-k class embeddings and their corresponding query embeddings. Note that classes are overlapping in terms of the samples they contain, as most samples contain instances of multiple classes.
  • Figure 3: Histogram of normalized entropies for distributions produced by GRASP, followed by the distributions with minimum, median, and maximum entropies. Recall that GRASP produces a distribution over samples within each class.
  • Figure 4: SWIL distributions with minimum (left) and maximum (right) entropies. Remember that SWIL produces a distribution over classes in the replay buffer.
  • Figure 5: Histogram of KNN-SVs across an entire training sequence (left) and histogram of distance ranks for lowest-scoring (most likely to be selected) images in a candidate set across the same sequence (right). A higher distance rank means the image is closer to the evaluation image; each candidate set contained 352 images.
  • ...and 3 more figures