Finding Structure in Continual Learning

Pourya Shamsolmoali; Masoumeh Zareapoor

Finding Structure in Continual Learning

Pourya Shamsolmoali, Masoumeh Zareapoor

TL;DR

This work tackles the stability-plasticity trade-off in continual learning by reframing optimization through Douglas-Rachford Splitting (DRS), decoupling plasticity (task-fitting) from stability (prior alignment). It employs a Bayesian latent space with posterior-to-prior propagation and a Rényi-divergence penalty to guide learning without replay buffers, proving convergence to stationary points of the composite objective $F=f+g$. The proposed algorithm alternates proximal steps for the task-fitting and prior-alignment terms and uses a relaxed update to fuse their outputs, ensuring interference between updates diminishes over time. Empirically, the method surpasses state-of-the-art baselines on diverse benchmarks, achieving high accuracy, low forgetting on disjoint tasks, and strong forward transfer on joint tasks, all with replay-free operation.

Abstract

Learning from a stream of tasks usually pits plasticity against stability: acquiring new knowledge often causes catastrophic forgetting of past information. Most methods address this by summing competing loss terms, creating gradient conflicts that are managed with complex and often inefficient strategies such as external memory replay or parameter regularization. We propose a reformulation of the continual learning objective using Douglas-Rachford Splitting (DRS). This reframes the learning process not as a direct trade-off, but as a negotiation between two decoupled objectives: one promoting plasticity for new tasks and the other enforcing stability of old knowledge. By iteratively finding a consensus through their proximal operators, DRS provides a more principled and stable learning dynamic. Our approach achieves an efficient balance between stability and plasticity without the need for auxiliary modules or complex add-ons, providing a simpler yet more powerful paradigm for continual learning systems.

Finding Structure in Continual Learning

TL;DR

. The proposed algorithm alternates proximal steps for the task-fitting and prior-alignment terms and uses a relaxed update to fuse their outputs, ensuring interference between updates diminishes over time. Empirically, the method surpasses state-of-the-art baselines on diverse benchmarks, achieving high accuracy, low forgetting on disjoint tasks, and strong forward transfer on joint tasks, all with replay-free operation.

Abstract

Paper Structure (26 sections, 6 theorems, 16 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 6 theorems, 16 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Our Approach
Problem Overview
DRS-based Continual Learner.
Discussion.
Experiments
Results Against Catastrophic Forgetting and Loss of Plasticity
Forgetting Analysis
Ablation study
Conclusion
Appendix
Posterior and Prior construction
Hyperparaments
Theoretical Analysis
...and 11 more sections

Key Result

Proposition 3.1

Let posterior $q(z)\!=\!\mathcal{N}(z|\mu_q,\Sigma_q)$ and prior $p(z)\!=\!\mathcal{N}(z|\mu_q, \Sigma_p)$ be Gaussian distributions. Consider the proximal operator problem, $q^\star\!=\!\arg\min_q[ D(q \parallel p)+\frac{1}{2\gamma} D(q \parallel v)]$, where $v(z)\!=\!\mathcal{N}(z|\mu_v,\Sigma_v)$

Figures (6)

Figure 1: The Stability-Plasticity dilemma in continual learning on EMNIST: (a) Illustrates the trade-off between online average accuracy and plasticity across various methods. Methods closer to the top-right corner better balance the ability to learn new tasks without forgetting. (b) Catastrophic forgetting: average accuracy over seen tasks vs. task index. Forgetful methods drop or remain low; a successful one maintain a consistently high curve throughout training. (c) Loss of plasticity: an ideal learner should maintain a high, stable performance on new tasks regardless of how many it has seen before. A downward-sloping curve on this plot is a sign that the model is losing its plasticity.
Figure 2: Addressing Catastrophic forgetting with Douglas-Rachford Splitting (DRS). (a) SGD optimizes only for the current/new task, causing the latent posterior $q_\phi$ to drift toward the new distribution, leading to forgetting of past knowledge $(\theta_{\text{past}})$. (b) DRS constrains the posterior within a region that supports both old and new task distributions, preserving prior knowledge. $\theta_{\mathrm{past}}, \theta_{\mathrm{new}}, \theta_{\mathrm{split}}$ represents the past, new and the balanced posteriors. (c) Our optimization loop: task-specific learning, retention, and a relaxation step that balances both forces. This structure avoids gradient interference and supports continual learning across long task sequences. Our Algorithm is in (\ref{['alg']}).
Figure 3: Forgetting analysis over 100 tasks on the CASIA classification benchmark. Each subplot summarizes average forgetting across 20-task intervals, and the final plot shows the average forgetting across all 100 tasks. We compare $\mathcal{L}_{LH}$ (likelihood only; no stability term), $\mathcal{L}_{KL}$ (KL regularizer), $\mathcal{L}_{RD}$ (Rényi divergence, but standard gradient updates), and Ours (DRS + Rényi).
Figure 4: (a) Ablation study on CIFAR100 (20 tasks) benchmark. (a) Average accuracy across different values of the divergence parameter $\alpha$. The best result is achieved at $\alpha = 2.0$, reaching an average accuracy of $\approx$77%, while the lowest performance occurs at $\alpha = 0.0$, dropping to $\approx$72%. (b) Relative training time for various methods using a RTX-3090 GPU. Our DRS-based continual learner achieves competitive runtime while outperforming all baselines in accuracy. SGD corresponds to standard optimization without DRS (i.e., direct minimization of Eq. \ref{['loss']}). The variant without latent sampling ($z'$) reduces compute time by 9%, but results in lower final accuracy. (c) Performance of KL-divergence (baselines) vs. D-divergence (our model). Using KL ($\alpha=1$) degrades the performance, and our model $(\alpha=2)$ consistently achieves higher accuracy and stability.
Figure 5: This figure compares forgetting behavior across 100 tasks on ImageNet for different approaches. Each subplot shows forgetting for a block of 10 tasks (e.g., Tasks 1–10, 11–20, ..., 91–100), with the final subplot aggregating all 100 tasks.
...and 1 more figures

Theorems & Definitions (12)

Proposition 3.1
proof
Proposition 3.2
proof
Proposition A.1: Convergence to a stationary point
proof
Lemma A.1: Interference control
proof
Proposition A.2: Controlling stability with the $\alpha$ parameter
proof
...and 2 more

Finding Structure in Continual Learning

TL;DR

Abstract

Finding Structure in Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (12)