Table of Contents
Fetching ...

Forget Forgetting: Continual Learning in a World of Abundant Memory

Dongkyu Cho, Taesup Moon, Rumi Chunara, Kyunghyun Cho, Sungmin Cha

TL;DR

This work argues that in real-world continual learning, exemplar memory is no longer the primary bottleneck; GPU time dominates, motivating a regime of abundant-but-not-exhaustive memory. It shows that while increased memory reduces forgetting (stability), it simultaneously reduces plasticity due to gradient reuse, necessitating cost-efficient interventions. The authors introduce Weight Space Consolidation, a lightweight method that combines rank-based parameter resets with weight averaging to restore plasticity while preserving stability, without storing per-task models. Empirical results across class-incremental image benchmarks and continual instruction tuning for LLMs demonstrate strong accuracy with replay-like costs and substantial reductions (3–4×) compared with expansion-based methods. Altogether, the paper challenges traditional CL assumptions and provides a practical baseline for scalable, cost-efficient continual learning in modern deployments.

Abstract

Continual learning (CL) has traditionally focused on minimizing exemplar memory, a constraint often misaligned with modern systems where GPU time, not storage, is the primary bottleneck. This paper challenges this paradigm by investigating a more realistic regime: one where memory is abundant enough to mitigate forgetting, but full retraining from scratch remains prohibitively expensive. In this practical "middle ground", we find that the core challenge shifts from stability to plasticity, as models become biased toward prior tasks and struggle to learn new ones. Conversely, improved stability allows simple replay baselines to outperform the state-of-the-art methods at a fraction of the GPU cost. To address this newly surfaced trade-off, we propose Weight Space Consolidation, a lightweight method that combines (1) rank-based parameter resets to restore plasticity with (2) weight averaging to enhance stability. Validated on both class-incremental learning with image classifiers and continual instruction tuning with large language models, our approach outperforms strong baselines while matching the low computational cost of replay, offering a scalable alternative to expensive full-retraining. These findings challenge long-standing CL assumptions and establish a new, cost-efficient baseline for real-world CL systems where exemplar memory is no longer the limiting factor.

Forget Forgetting: Continual Learning in a World of Abundant Memory

TL;DR

This work argues that in real-world continual learning, exemplar memory is no longer the primary bottleneck; GPU time dominates, motivating a regime of abundant-but-not-exhaustive memory. It shows that while increased memory reduces forgetting (stability), it simultaneously reduces plasticity due to gradient reuse, necessitating cost-efficient interventions. The authors introduce Weight Space Consolidation, a lightweight method that combines rank-based parameter resets with weight averaging to restore plasticity while preserving stability, without storing per-task models. Empirical results across class-incremental image benchmarks and continual instruction tuning for LLMs demonstrate strong accuracy with replay-like costs and substantial reductions (3–4×) compared with expansion-based methods. Altogether, the paper challenges traditional CL assumptions and provides a practical baseline for scalable, cost-efficient continual learning in modern deployments.

Abstract

Continual learning (CL) has traditionally focused on minimizing exemplar memory, a constraint often misaligned with modern systems where GPU time, not storage, is the primary bottleneck. This paper challenges this paradigm by investigating a more realistic regime: one where memory is abundant enough to mitigate forgetting, but full retraining from scratch remains prohibitively expensive. In this practical "middle ground", we find that the core challenge shifts from stability to plasticity, as models become biased toward prior tasks and struggle to learn new ones. Conversely, improved stability allows simple replay baselines to outperform the state-of-the-art methods at a fraction of the GPU cost. To address this newly surfaced trade-off, we propose Weight Space Consolidation, a lightweight method that combines (1) rank-based parameter resets to restore plasticity with (2) weight averaging to enhance stability. Validated on both class-incremental learning with image classifiers and continual instruction tuning with large language models, our approach outperforms strong baselines while matching the low computational cost of replay, offering a scalable alternative to expensive full-retraining. These findings challenge long-standing CL assumptions and establish a new, cost-efficient baseline for real-world CL systems where exemplar memory is no longer the limiting factor.

Paper Structure

This paper contains 43 sections, 19 equations, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of (y-axis) average class-incremental accuracy and (x-axis) training time under different exemplar memory sizes in class-incremental learning for 10-task using CIFAR-100. As memory increases, catastrophic forgetting is mitigated (i.e., increase in accuracy), but training time (i.e., computation cost) also grows proportionally. Note that the DER, FOSTER, and MEMO are expansion-based methods (shown with X mark): FOSTER doubles the model size, while DER and MEMO scale with the number of tasks. Compared to these costly methods, Replay and Ours demonstrate high accuracy with significantly lower cost, where our method offers the highest cost efficiency, closely approaching that of the cost lower-bound cost (i.e., Replay)
  • Figure 2: Comparison of (a) average new-task accuracy under different exemplar memory sizes and (b) training loss under full memory in class-incremental learning for 10 tasks using CIFAR-100. As memory increases, the model’s ability to adapt to new tasks declines, resulting in reduced accuracy and slower convergence. Notably, in (b), resetting model weights before each task restores plasticity and facilitates training.
  • Figure 3: Comparison of average score and relative VRAM usage measured as minutes under different exemplar memory sizes in LLM continual instruction tuning for 8-task using TRACE.
  • Figure 4: The impact of exemplar memory size on catastrophic forgetting. Increased memory drastically reduces the forgetting between tasks, while it persists.
  • Figure 5: A comparison of plasticity loss (measured using the average of the new task accuracy) across different exemplar memory sizes in the 10 task scenario of CIFAR-100 and ImageNet-100. As memory size increases, models lose their ability to learn new tasks.
  • ...and 7 more figures