Table of Contents
Fetching ...

SuRe: Surprise-Driven Prioritised Replay for Continual LLM Learning

Hugo Hazard, Zafeirios Fountas, Martin A. Benfeghoul, Adnan Oomerjee, Jun Wang, Haitham Bou-Ammar

TL;DR

The paper tackles catastrophic forgetting in continual learning for large language models by decomposing forgetting into selection and integration errors. It introduces Surprise-prioritised Replay (SuRe) to selectively store high-surprise sequences, and couples it with a dual-learner framework using EMA to stabilize integration. Empirical results show state-of-the-art performance in the Large Number of Tasks setting and strong average performance across standard CL benchmarks, with robust ablations validating the approach. The work positions replay, when guided by surprise and combined with slow-weight consolidation, as a competitive baseline for continual LLM fine-tuning with practical sample efficiency.

Abstract

Continual learning, one's ability to adapt to a sequence of tasks without forgetting previously acquired knowledge, remains a major challenge in machine learning and a key gap between artificial and human intelligence. While regularisation and replay perform well in vision, they lag behind multi-task learning for large language models (LLMs), especially at scale with many tasks. We revisit replay and argue that two failure modes drive this gap: selection (what to rehearse) and integration (how to consolidate new knowledge). To address selection, we propose Surprise-prioritised Replay (SuRe), a simple, architecture-agnostic rule that ranks and stores the most surprising (high Negative Log-Likelihood) sequences. SuRe achieves state-of-the-art performance in the Large Number of Tasks (LNT) setting and delivers the best overall average across both Standard CL and LNT benchmarks. To address integration, we add a dual-learner design with fast and slow LoRA adapters merged via an exponential moving average (EMA), enabling rapid adaptation while stabilising long-term knowledge. Combining SuRe with the dual learner yields further gains, including improvements of up to +5 accuracy points on LNT over prior SOTA. Ablation studies confirm that our proposed method remains robust under reduced replay frequency and small buffer size, demonstrating both effectiveness and sample efficiency. Taken together, our results establish replay as a strong baseline for continual LLM fine-tuning and demonstrate that surprise-based selection and slow-weight consolidation are complementary components for mitigating catastrophic forgetting.

SuRe: Surprise-Driven Prioritised Replay for Continual LLM Learning

TL;DR

The paper tackles catastrophic forgetting in continual learning for large language models by decomposing forgetting into selection and integration errors. It introduces Surprise-prioritised Replay (SuRe) to selectively store high-surprise sequences, and couples it with a dual-learner framework using EMA to stabilize integration. Empirical results show state-of-the-art performance in the Large Number of Tasks setting and strong average performance across standard CL benchmarks, with robust ablations validating the approach. The work positions replay, when guided by surprise and combined with slow-weight consolidation, as a competitive baseline for continual LLM fine-tuning with practical sample efficiency.

Abstract

Continual learning, one's ability to adapt to a sequence of tasks without forgetting previously acquired knowledge, remains a major challenge in machine learning and a key gap between artificial and human intelligence. While regularisation and replay perform well in vision, they lag behind multi-task learning for large language models (LLMs), especially at scale with many tasks. We revisit replay and argue that two failure modes drive this gap: selection (what to rehearse) and integration (how to consolidate new knowledge). To address selection, we propose Surprise-prioritised Replay (SuRe), a simple, architecture-agnostic rule that ranks and stores the most surprising (high Negative Log-Likelihood) sequences. SuRe achieves state-of-the-art performance in the Large Number of Tasks (LNT) setting and delivers the best overall average across both Standard CL and LNT benchmarks. To address integration, we add a dual-learner design with fast and slow LoRA adapters merged via an exponential moving average (EMA), enabling rapid adaptation while stabilising long-term knowledge. Combining SuRe with the dual learner yields further gains, including improvements of up to +5 accuracy points on LNT over prior SOTA. Ablation studies confirm that our proposed method remains robust under reduced replay frequency and small buffer size, demonstrating both effectiveness and sample efficiency. Taken together, our results establish replay as a strong baseline for continual LLM fine-tuning and demonstrate that surprise-based selection and slow-weight consolidation are complementary components for mitigating catastrophic forgetting.

Paper Structure

This paper contains 39 sections, 3 theorems, 21 equations, 5 figures, 11 tables, 1 algorithm.

Key Result

Lemma 1

For all $\theta$ in the local region,

Figures (5)

  • Figure 1: 1. During training the base and slow LoRA weights are frozen, while the fast LoRA is updated on current samples plus replayed examples from the surprise buffer. The buffer is updated to retain the most surprising samples per task. 2. After each step, the fast and slow LoRA weights are merged via an EMA. 3. At inference, only the base model and the slow learner are used for prediction.
  • Figure 2: Naive sequential fine-tuning (SeqFT) with T5-Large on the Large Number of Tasks (LNT) benchmark.
  • Figure 3: Reservoir Buffer replay with T5-Large on the Large Number of Tasks benchmark. Heatmap shows test task (x-axis) evaluated after each training task (y-axis).
  • Figure 4: Surprise Buffer replay with T5-Large on the Large Number of Tasks benchmark. Same visualisation as Figure \ref{['fig:replay_large_heatmap']}.
  • Figure 5: Slow learner with Surprise Buffer with T5-Large on the Large Number of Tasks benchmark. Same visualisation as Figure \ref{['fig:replay_large_heatmap']}.

Theorems & Definitions (3)

  • Lemma 1: Selection mismatch via IPM
  • Lemma 2: EMA reduces integration variance
  • Theorem 1: Additive bound; complementary controls