Table of Contents
Fetching ...

Temporal-Difference Variational Continual Learning

Luckeciano C. Melo, Alessandro Abate, Yarin Gal

TL;DR

This work tackles catastrophic forgetting in Bayesian continual learning by addressing compounding posterior-approximation errors in Variational Continual Learning (VCL). It introduces Temporal-Difference Variational Continual Learning (TD-VCL), a family of objectives that bootstraps posterior updates using multiple past posteriors and draws a principled link to Temporal-Difference methods. The framework includes two concrete instantiations: n-Step KL Regularization and TD($\lambda$)-VCL, which represent a spectrum from vanilla VCL to multi-step KL regularization, effectively balancing plasticity and memory stability. Empirical results on hard CL benchmarks demonstrate that TD-VCL and its TD variants outperform strong Bayesian baselines, with robustness to boundary assumptions and favorable per-task performance, highlighting practical impact for uncertainty-aware continual learning systems. Overall, the approach unites variational inference and TD bootstrapping to yield scalable, memory-stable continual learners with improved resistance to forgetting in non-stationary environments.

Abstract

Machine Learning models in real-world applications must continuously learn new tasks to adapt to shifts in the data-generating distribution. Yet, for Continual Learning (CL), models often struggle to balance learning new tasks (plasticity) with retaining previous knowledge (memory stability). Consequently, they are susceptible to Catastrophic Forgetting, which degrades performance and undermines the reliability of deployed systems. In the Bayesian CL literature, variational methods tackle this challenge by employing a learning objective that recursively updates the posterior distribution while constraining it to stay close to its previous estimate. Nonetheless, we argue that these methods may be ineffective due to compounding approximation errors over successive recursions. To mitigate this, we propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations, preventing individual errors from dominating future posterior updates and compounding over time. We reveal insightful connections between these objectives and Temporal-Difference methods, a popular learning mechanism in Reinforcement Learning and Neuroscience. Experiments on challenging CL benchmarks show that our approach effectively mitigates Catastrophic Forgetting, outperforming strong Variational CL methods.

Temporal-Difference Variational Continual Learning

TL;DR

This work tackles catastrophic forgetting in Bayesian continual learning by addressing compounding posterior-approximation errors in Variational Continual Learning (VCL). It introduces Temporal-Difference Variational Continual Learning (TD-VCL), a family of objectives that bootstraps posterior updates using multiple past posteriors and draws a principled link to Temporal-Difference methods. The framework includes two concrete instantiations: n-Step KL Regularization and TD()-VCL, which represent a spectrum from vanilla VCL to multi-step KL regularization, effectively balancing plasticity and memory stability. Empirical results on hard CL benchmarks demonstrate that TD-VCL and its TD variants outperform strong Bayesian baselines, with robustness to boundary assumptions and favorable per-task performance, highlighting practical impact for uncertainty-aware continual learning systems. Overall, the approach unites variational inference and TD bootstrapping to yield scalable, memory-stable continual learners with improved resistance to forgetting in non-stationary environments.

Abstract

Machine Learning models in real-world applications must continuously learn new tasks to adapt to shifts in the data-generating distribution. Yet, for Continual Learning (CL), models often struggle to balance learning new tasks (plasticity) with retaining previous knowledge (memory stability). Consequently, they are susceptible to Catastrophic Forgetting, which degrades performance and undermines the reliability of deployed systems. In the Bayesian CL literature, variational methods tackle this challenge by employing a learning objective that recursively updates the posterior distribution while constraining it to stay close to its previous estimate. Nonetheless, we argue that these methods may be ineffective due to compounding approximation errors over successive recursions. To mitigate this, we propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations, preventing individual errors from dominating future posterior updates and compounding over time. We reveal insightful connections between these objectives and Temporal-Difference methods, a popular learning mechanism in Reinforcement Learning and Neuroscience. Experiments on challenging CL benchmarks show that our approach effectively mitigates Catastrophic Forgetting, outperforming strong Variational CL methods.

Paper Structure

This paper contains 35 sections, 7 theorems, 35 equations, 11 figures, 10 tables.

Key Result

Proposition 4.0

The standard KL minimization objective in Variational Continual Learning (Equation eq2) is equivalently represented as the following objective, where $n \in \mathbb{N}_{0}$ is a hyperparameter:

Figures (11)

  • Figure 1: Average accuracy across observed tasks in the PermutedMNIST-Hard benchmark. The TD-VCL approach, proposed in this work, leads to a substantial improvement against standard VCL and non-variational approaches.
  • Figure 2: An intuitive illustration of how TD-VCL functions in comparison to vanilla VCL. At each timestep $t$, a new task dataset $\mathcal{D}_{t}$ arrives. Both methods aim to learn variational parameters $q_{t}(\boldsymbol{\theta})$ over a family of distributions $\mathcal{Q}$ that approximates the true posterior $p(\boldsymbol{\theta} \mid \mathcal{D}_{1:t})$ via minimizing the KL divergence $\mathcal{D}_{KL}(q_{t}(\boldsymbol{\theta}) \mid \mid p(\boldsymbol{\theta} \mid \mathcal{D}_{1:t}))$. VCL optimization (left) is only constrained by the most recent posterior, which compounds approximation errors from previous estimations and potentially deviates far from the true posterior. TD-VCL (right) is regularized by a sequence of past estimations, alleviating the impact of compounded errors.
  • Figure 3: Per-task performance (accuracy) over time in the PermutedMNIST-Hard benchmark. Each plot represents the accuracy of one task (identified in the plot title) while the number of observed tasks increases. We highlight a stronger effect of Catastrophic Forgetting on earlier tasks for the baselines, while TD-VCL objectives are noticeably more robust to this phenomenon.
  • Figure 4: A Replay Buffer analysis on the PermutedMNIST. Each curve represents a model re-trained on a buffer composed of "$T$" previous tasks, "$B$" examples of each. Online MLE only considers the current task. Allowing "unlimited" access to previous task data trivializes the CL setting, and a simple MLE baseline is enough to attain strong results. Nevertheless, as we restrict the replay buffer in size and number of tasks, the benchmark becomes substantially more challenging and shows signs of Catastrophic Forgetting.
  • Figure 5: SplitMNIST results. The first five plots show results per task, and the last one is an average across tasks. As a consequence of multi-head networks simplifying the Continual Learning challenge, all methods attain high accuracy. In particular, variational methods accuracies ranging from 97% and 98%. In constrast, SplitMNIST-Hard in Figure \ref{['fig:splitmnistsinglehead']}, provides a considerably more challenging CL benchmark.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Proposition 4.0
  • Proposition 4.0
  • Definition 4.0
  • Proposition 4.0
  • Proposition A.0
  • proof
  • Lemma B.0
  • proof
  • Proposition B.0
  • proof
  • ...and 2 more