Table of Contents
Fetching ...

Primal Dual Continual Learning: Balancing Stability and Plasticity through Adaptive Memory Allocation

Juan Elenter, Navid NaderiAlizadeh, Tara Javidi, Alejandro Ribeiro

TL;DR

This paper reframes continual learning as a constrained optimization problem to prevent forgetting past tasks while learning new ones. It employs Lagrangian duality to derive a primal-dual algorithm (PDCL) that uses dual variables to adapt replay-buffer usage, both across tasks (buffer partition) and within tasks (sample selection). The approach provides theoretical connections between dual variables and the stability-plasticity trade-off, along with a practical mechanism to allocate memory where it matters most. Empirical results across image, audio, and medical benchmarks show that duality-driven buffer management improves accuracy and reduces forgetting, though benefits decline with very small memory budgets or insufficient model capacity. The work offers a principled, scalable path toward adaptive memory management in continual learning and highlights future directions for large-model settings and constraint design.

Abstract

Continual learning is inherently a constrained learning problem. The goal is to learn a predictor under a no-forgetting requirement. Although several prior studies formulate it as such, they do not solve the constrained problem explicitly. In this work, we show that it is both possible and beneficial to undertake the constrained optimization problem directly. To do this, we leverage recent results in constrained learning through Lagrangian duality. We focus on memory-based methods, where a small subset of samples from previous tasks can be stored in a replay buffer. In this setting, we analyze two versions of the continual learning problem: a coarse approach with constraints at the task level and a fine approach with constraints at the sample level. We show that dual variables indicate the sensitivity of the optimal value of the continual learning problem with respect to constraint perturbations. We then leverage this result to partition the buffer in the coarse approach, allocating more resources to harder tasks, and to populate the buffer in the fine approach, including only impactful samples. We derive a deviation bound on dual variables as sensitivity indicators, and empirically corroborate this result in diverse continual learning benchmarks. We also discuss the limitations of these methods with respect to the amount of memory available and the expressiveness of the parametrization.

Primal Dual Continual Learning: Balancing Stability and Plasticity through Adaptive Memory Allocation

TL;DR

This paper reframes continual learning as a constrained optimization problem to prevent forgetting past tasks while learning new ones. It employs Lagrangian duality to derive a primal-dual algorithm (PDCL) that uses dual variables to adapt replay-buffer usage, both across tasks (buffer partition) and within tasks (sample selection). The approach provides theoretical connections between dual variables and the stability-plasticity trade-off, along with a practical mechanism to allocate memory where it matters most. Empirical results across image, audio, and medical benchmarks show that duality-driven buffer management improves accuracy and reduces forgetting, though benefits decline with very small memory budgets or insufficient model capacity. The work offers a principled, scalable path toward adaptive memory management in continual learning and highlights future directions for large-model settings and constraint design.

Abstract

Continual learning is inherently a constrained learning problem. The goal is to learn a predictor under a no-forgetting requirement. Although several prior studies formulate it as such, they do not solve the constrained problem explicitly. In this work, we show that it is both possible and beneficial to undertake the constrained optimization problem directly. To do this, we leverage recent results in constrained learning through Lagrangian duality. We focus on memory-based methods, where a small subset of samples from previous tasks can be stored in a replay buffer. In this setting, we analyze two versions of the continual learning problem: a coarse approach with constraints at the task level and a fine approach with constraints at the sample level. We show that dual variables indicate the sensitivity of the optimal value of the continual learning problem with respect to constraint perturbations. We then leverage this result to partition the buffer in the coarse approach, allocating more resources to harder tasks, and to populate the buffer in the fine approach, including only impactful samples. We derive a deviation bound on dual variables as sensitivity indicators, and empirically corroborate this result in diverse continual learning benchmarks. We also discuss the limitations of these methods with respect to the amount of memory available and the expressiveness of the parametrization.
Paper Structure (27 sections, 7 theorems, 74 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 27 sections, 7 theorems, 74 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Let $m_k$ be the unconstrained minimum associated to task $k$ and let $M$ be the Lipschitz constant of the loss $\ell(\cdot, y)$. Under Assumption ass:task_sim, there exists $\theta \in \Theta$ such that,

Figures (7)

  • Figure 1: Diagram of a $\nu-$Universal Parametrization $\mathcal{F}_{\Theta}$ of $\mathcal{F}$.
  • Figure 2: Left: Dual variables indicate the sensitivity of the performance on the current task with respect to the no-forgetting requirement enforced on past tasks (Theorem \ref{['theo:omegasubdiff']}). Right: Impact of minimum enforced partition size in speech classification (see Section \ref{['sec:abp']}).
  • Figure 3: Leveraging dual variables (PDCL$_0$) to adaptively weight the replay losses provides an improvement over fixed regularization weights (ER Ring and Reservoir). Partitioning the buffer non-uniformly (PDCL) according to the task difficulty measured by $\mathbf{\lambda}$ improves over uniform (PDCL$_0$, ER Ring, AGEM) partitions. Gradient projections (AGEM) tend to perform worse than replays.
  • Figure 4: Left: Evolution of non-uniform partitions obtained by (PDCL) in OrganA, and its distance from a uniform one. Center: If $\epsilon$ is set too loose, we allow larger forgetting. Conversely, if $\epsilon$ is too tight, (\ref{['Pt']}) can become harder to solve due to the reduction of its feasible set. Right: For a tight $\epsilon$, we violate the constraint and the associated dual variable can grow indefinitely. For a loose $\epsilon$, $\lambda_0\to0$, indicating that the performance on task 0 is not limiting learning the current task.
  • Figure 5: Class clusters with $\lambda_{x, y}$ indicated by marker size. Large dual variables accumulate in the task decision boundary and edges cluster.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Proposition 1
  • Theorem 1
  • Corollary 1
  • Lemma 1
  • Lemma 2
  • Proposition 2
  • Lemma 3
  • Definition 1