Table of Contents
Fetching ...

Continual Learning With Quasi-Newton Methods

Steven Vander Eeckt, Hugo Van hamme

TL;DR

This work addresses catastrophic forgetting in sequential task learning by extending EWC with Sampled Quasi-Newton Hessian approximations, moving beyond the diagonal Fisher Information Matrix to capture richer parameter interactions. CSQN integrates SQN-based Hessian updates with the EWC framework and introduces memory-reduction variants (CT, BTREE, MRT) to scale to many tasks while preserving performance. Across Rotated MNIST, Split CIFAR-10/100, Split TinyImageNet, and Vision Datasets, CSQN consistently outperforms EWC and many baselines, reducing forgetting by about 50% on average and improving overall accuracy by roughly 8%, though KF remains a strong competitor in some tasks. The methods are architecture-agnostic and straightforward to implement, making CSQN a robust, scalable solution for continual learning with meaningful practical impact and clear directions for reducing memory overhead further.

Abstract

Catastrophic forgetting remains a major challenge when neural networks learn tasks sequentially. Elastic Weight Consolidation (EWC) attempts to address this problem by introducing a Bayesian-inspired regularization loss to preserve knowledge of previously learned tasks. However, EWC relies on a Laplace approximation where the Hessian is simplified to the diagonal of the Fisher information matrix, assuming uncorrelated model parameters. This overly simplistic assumption often leads to poor Hessian estimates, limiting its effectiveness. To overcome this limitation, we introduce Continual Learning with Sampled Quasi-Newton (CSQN), which leverages Quasi-Newton methods to compute more accurate Hessian approximations. CSQN captures parameter interactions beyond the diagonal without requiring architecture-specific modifications, making it applicable across diverse tasks and architectures. Experimental results across four benchmarks demonstrate that CSQN consistently outperforms EWC and other state-of-the-art baselines, including rehearsal-based methods. CSQN reduces EWC's forgetting by 50 percent and improves its performance by 8 percent on average. Notably, CSQN achieves superior results on three out of four benchmarks, including the most challenging scenarios, highlighting its potential as a robust solution for continual learning.

Continual Learning With Quasi-Newton Methods

TL;DR

This work addresses catastrophic forgetting in sequential task learning by extending EWC with Sampled Quasi-Newton Hessian approximations, moving beyond the diagonal Fisher Information Matrix to capture richer parameter interactions. CSQN integrates SQN-based Hessian updates with the EWC framework and introduces memory-reduction variants (CT, BTREE, MRT) to scale to many tasks while preserving performance. Across Rotated MNIST, Split CIFAR-10/100, Split TinyImageNet, and Vision Datasets, CSQN consistently outperforms EWC and many baselines, reducing forgetting by about 50% on average and improving overall accuracy by roughly 8%, though KF remains a strong competitor in some tasks. The methods are architecture-agnostic and straightforward to implement, making CSQN a robust, scalable solution for continual learning with meaningful practical impact and clear directions for reducing memory overhead further.

Abstract

Catastrophic forgetting remains a major challenge when neural networks learn tasks sequentially. Elastic Weight Consolidation (EWC) attempts to address this problem by introducing a Bayesian-inspired regularization loss to preserve knowledge of previously learned tasks. However, EWC relies on a Laplace approximation where the Hessian is simplified to the diagonal of the Fisher information matrix, assuming uncorrelated model parameters. This overly simplistic assumption often leads to poor Hessian estimates, limiting its effectiveness. To overcome this limitation, we introduce Continual Learning with Sampled Quasi-Newton (CSQN), which leverages Quasi-Newton methods to compute more accurate Hessian approximations. CSQN captures parameter interactions beyond the diagonal without requiring architecture-specific modifications, making it applicable across diverse tasks and architectures. Experimental results across four benchmarks demonstrate that CSQN consistently outperforms EWC and other state-of-the-art baselines, including rehearsal-based methods. CSQN reduces EWC's forgetting by 50 percent and improves its performance by 8 percent on average. Notably, CSQN achieves superior results on three out of four benchmarks, including the most challenging scenarios, highlighting its potential as a robust solution for continual learning.

Paper Structure

This paper contains 32 sections, 12 equations, 7 figures, 5 tables, 3 algorithms.

Figures (7)

  • Figure 1: Illustration of the binary tree of tasks for the BTREE method with $T=4$. After learning task 2, the $\bm{Z}$ matrices of tasks 1 and 2 are concatenated and reduced with SVD to form a single $\bm{Z}$ matrix of size $(2)M$. Following task 3, no reduction is applied. After task 4 is learned, the $\bm{Z}$ matrices of tasks 3 and 4 are concatenated and reduced. In a second step, the $\bm{Z}$ matrix from tasks 1 and 2, and the $\bm{Z}$ matrix from tasks 3 and 4 are further reduced into a single $\bm{Z}$ matrix of size $(2)M$.
  • Figure 2: Average accuracy in $\%$ after each task for Rotated MNIST.
  • Figure 3: Average accuracy in $\%$ after each task for Split CIFAR-10/100 experiments.
  • Figure 4: Average accuracy in $\%$ after each task for Split TinyImageNet.
  • Figure 5: Average accuracy in $\%$ after each task for Vision Datasets experiments.
  • ...and 2 more figures