Table of Contents
Fetching ...

Gradient Projection Memory for Continual Learning

Gobinda Saha, Isha Garg, Kaushik Roy

TL;DR

Catastrophic forgetting hampers sequential task learning in fixed networks. The paper introduces Gradient Projection Memory (GPM), which stores low‑dimensional bases of core gradient spaces derived from layer activations via SVD and enforces orthogonal gradient updates for new tasks. This approach preserves past knowledge with competitive accuracy while reducing memory requirements and preserving data privacy. It scales to deep networks and long task sequences, offering a practical alternative to data replay or network growth.

Abstract

The ability to learn continually without forgetting the past tasks is a desired attribute for artificial learning systems. Existing approaches to enable such learning in artificial neural networks usually rely on network growth, importance based weight update or replay of old data from the memory. In contrast, we propose a novel approach where a neural network learns new tasks by taking gradient steps in the orthogonal direction to the gradient subspaces deemed important for the past tasks. We find the bases of these subspaces by analyzing network representations (activations) after learning each task with Singular Value Decomposition (SVD) in a single shot manner and store them in the memory as Gradient Projection Memory (GPM). With qualitative and quantitative analyses, we show that such orthogonal gradient descent induces minimum to no interference with the past tasks, thereby mitigates forgetting. We evaluate our algorithm on diverse image classification datasets with short and long sequences of tasks and report better or on-par performance compared to the state-of-the-art approaches.

Gradient Projection Memory for Continual Learning

TL;DR

Catastrophic forgetting hampers sequential task learning in fixed networks. The paper introduces Gradient Projection Memory (GPM), which stores low‑dimensional bases of core gradient spaces derived from layer activations via SVD and enforces orthogonal gradient updates for new tasks. This approach preserves past knowledge with competitive accuracy while reducing memory requirements and preserving data privacy. It scales to deep networks and long task sequences, offering a practical alternative to data replay or network growth.

Abstract

The ability to learn continually without forgetting the past tasks is a desired attribute for artificial learning systems. Existing approaches to enable such learning in artificial neural networks usually rely on network growth, importance based weight update or replay of old data from the memory. In contrast, we propose a novel approach where a neural network learns new tasks by taking gradient steps in the orthogonal direction to the gradient subspaces deemed important for the past tasks. We find the bases of these subspaces by analyzing network representations (activations) after learning each task with Singular Value Decomposition (SVD) in a single shot manner and store them in the memory as Gradient Projection Memory (GPM). With qualitative and quantitative analyses, we show that such orthogonal gradient descent induces minimum to no interference with the past tasks, thereby mitigates forgetting. We evaluate our algorithm on diverse image classification datasets with short and long sequences of tasks and report better or on-par performance compared to the state-of-the-art approaches.

Paper Structure

This paper contains 28 sections, 13 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of convolution operation in matrix multiplication format during (a) Forward Pass and (b) Backward Pass.
  • Figure 2: (a) Memory utilization and (b) per epoch training time for PMNIST tasks for different methods. Memory utilization for different approaches for (c) CIFAR-100, (d) miniImageNet and (e) 5-Datasets tasks. For memory, size of GPM_Max and for time, method with highest complexity is used as references (value of 1). All the other methods are reported relative to these references.
  • Figure 3: Histograms of interference activations as a function of threshold, ($\epsilon_{th}$) at (a) Conv layer 2 (b) FC layer 2 for split CIFAR-100 tasks. (c) Impact of $\epsilon_{th}$ on ACC (%) and BWT(%). With increasing value of $\epsilon_{th}$, spread of interference reduces, which improves accuracy and reduces forgetting.
  • Figure 4: Evolution of task 1 accuracy over the course of incremental learning of 20 sequential tasks from miniImageNet dataset. Learned accuracy in our method remains stable throughout learning.
  • Figure 5: Illustration of how threshold hyperparameter controls the degree of interference at (a) Conv layer 1 (b) Conv layer 3 (c) FC layer 1 with the histogram plots of interference activations from Split CIFAR-100 experiment. With increasing $\epsilon_{th}$, spread of the inference activation decreases resulting in minimization of forgetting.