Table of Contents
Fetching ...

Natural Gradient Descent for Online Continual Learning

Joe Khawand, David Colliaux

Abstract

Online Continual Learning (OCL) for image classification represents a challenging subset of Continual Learning, focusing on classifying images from a stream without assuming data independence and identical distribution (i.i.d). The primary challenge in this context is to prevent catastrophic forgetting, where the model's performance on previous tasks deteriorates as it learns new ones. Although various strategies have been proposed to address this issue, achieving rapid convergence remains a significant challenge in the online setting. In this work, we introduce a novel approach to training OCL models that utilizes the Natural Gradient Descent optimizer, incorporating an approximation of the Fisher Information Matrix (FIM) through Kronecker Factored Approximate Curvature (KFAC). This method demonstrates substantial improvements in performance across all OCL methods, particularly when combined with existing OCL tricks, on datasets such as Split CIFAR-100, CORE50, and Split miniImageNet.

Natural Gradient Descent for Online Continual Learning

Abstract

Online Continual Learning (OCL) for image classification represents a challenging subset of Continual Learning, focusing on classifying images from a stream without assuming data independence and identical distribution (i.i.d). The primary challenge in this context is to prevent catastrophic forgetting, where the model's performance on previous tasks deteriorates as it learns new ones. Although various strategies have been proposed to address this issue, achieving rapid convergence remains a significant challenge in the online setting. In this work, we introduce a novel approach to training OCL models that utilizes the Natural Gradient Descent optimizer, incorporating an approximation of the Fisher Information Matrix (FIM) through Kronecker Factored Approximate Curvature (KFAC). This method demonstrates substantial improvements in performance across all OCL methods, particularly when combined with existing OCL tricks, on datasets such as Split CIFAR-100, CORE50, and Split miniImageNet.
Paper Structure (20 sections, 8 equations, 2 figures, 5 tables)

This paper contains 20 sections, 8 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Average End Accuracy on Split-CIFAR-100 krizhevsky_learning_2009 for 3 buffer sizes (1k-5k-10k) using ER rolnick_experience_2019 in combination with OCL tricks mai_online_2022.
  • Figure 2: Average End Accuracy by trick, averaged on all datasets in the OCI setting with regards to the damping parameter $\epsilon$.

Theorems & Definitions (2)

  • Definition 1
  • Definition 2