Table of Contents
Fetching ...

Generalisation Guarantees for Continual Learning with Orthogonal Gradient Descent

Mehdi Abbana Bennani, Thang Doan, Masashi Sugiyama

TL;DR

This work develops a theoretical NTK-based framework for Continual Learning and analyzes Orthogonal Gradient Descent (OGD) within that regime. It derives a closed-form, recursive kernel regression description across tasks, establishing linear convergence under suitable learning-rate conditions and proving a no-forgetting property for OGD with infinite memory. The authors also provide generalisation bounds showing task similarity, via the NTK, governs learning performance, and they demonstrate that NTK variation can limit forgetting protection, motivating the OGD+ variant. Empirical results on standard benchmarks validate the theory, revealing that over-parameterization and controlled NTK drift improve robustness to forgetting, while curriculum-like task sequences can influence generalisation through task similarity. Overall, the paper links Continual Learning, kernel methods, and curriculum design to offer provable learning guarantees and practical insights for managing forgetting in sequential task settings.

Abstract

In Continual Learning settings, deep neural networks are prone to Catastrophic Forgetting. Orthogonal Gradient Descent was proposed to tackle the challenge. However, no theoretical guarantees have been proven yet. We present a theoretical framework to study Continual Learning algorithms in the Neural Tangent Kernel regime. This framework comprises closed form expression of the model through tasks and proxies for Transfer Learning, generalisation and tasks similarity. In this framework, we prove that OGD is robust to Catastrophic Forgetting then derive the first generalisation bound for SGD and OGD for Continual Learning. Finally, we study the limits of this framework in practice for OGD and highlight the importance of the Neural Tangent Kernel variation for Continual Learning with OGD.

Generalisation Guarantees for Continual Learning with Orthogonal Gradient Descent

TL;DR

This work develops a theoretical NTK-based framework for Continual Learning and analyzes Orthogonal Gradient Descent (OGD) within that regime. It derives a closed-form, recursive kernel regression description across tasks, establishing linear convergence under suitable learning-rate conditions and proving a no-forgetting property for OGD with infinite memory. The authors also provide generalisation bounds showing task similarity, via the NTK, governs learning performance, and they demonstrate that NTK variation can limit forgetting protection, motivating the OGD+ variant. Empirical results on standard benchmarks validate the theory, revealing that over-parameterization and controlled NTK drift improve robustness to forgetting, while curriculum-like task sequences can influence generalisation through task similarity. Overall, the paper links Continual Learning, kernel methods, and curriculum design to offer provable learning guarantees and practical insights for managing forgetting in sequential task settings.

Abstract

In Continual Learning settings, deep neural networks are prone to Catastrophic Forgetting. Orthogonal Gradient Descent was proposed to tackle the challenge. However, no theoretical guarantees have been proven yet. We present a theoretical framework to study Continual Learning algorithms in the Neural Tangent Kernel regime. This framework comprises closed form expression of the model through tasks and proxies for Transfer Learning, generalisation and tasks similarity. In this framework, we prove that OGD is robust to Catastrophic Forgetting then derive the first generalisation bound for SGD and OGD for Continual Learning. Finally, we study the limits of this framework in practice for OGD and highlight the importance of the Neural Tangent Kernel variation for Continual Learning with OGD.

Paper Structure

This paper contains 114 sections, 15 theorems, 40 equations, 6 figures, 12 tables.

Key Result

Theorem 1

Figures (6)

  • Figure 1: The variation of the train accuracy on the memorised samples from the first task as a function of overparameterization (higher is better). The forgetting decreases with overparameterization, as stated in Theorem \ref{['thm:train-unchanged']}.
  • Figure 2: The variation of the train accuracy on the memorised samples from each task, after the model was trained on all tasks in sequence (higher is better). We vary the hidden size as a proxy for overparameterization.
  • Figure 3: The variation of the train accuracy on the memorised samples from the each task, after the model was trained on all tasks in sequence (higher is better). We vary the hidden size as a proxy for overparameterization.
  • Figure 4: The variation of the train accuracy on the memorised samples from the first task, after the model was trained on all tasks in sequence (higher is better). We vary the memory size per task from 100 to 300.
  • Figure 5: Test accuracy on the 10 first tasks of Rotated MNIST, for SGD, OGD, OGD+ and A-GEM. The y-axis is truncated for clarity. We report the mean and standard deviation over 5 independent runs. The test error is measured for every 250 mini-batch interval.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Theorem 1: Continual Learning as a recursive Kernel Regression
  • Remark 1
  • Corollary 1
  • Remark 2
  • Theorem 2: No-forgetting Continual Learning with OGD
  • Theorem 3: Generalisation of SGD and OGD for Continual Learning
  • Lemma 1
  • Proposition 1
  • Theorem 4
  • Lemma 2
  • ...and 7 more