Table of Contents
Fetching ...

Gradient Correlation Subspace Learning against Catastrophic Forgetting

Tammuz Dubnov, Vishal Thengane

TL;DR

Gradient Correlation Subspace Learning (GCSL) addresses catastrophic forgetting in incremental class learning by freezing prior weights and learning new tasks within a per-layer subspace defined by the gradient correlation structure. For each layer, GCSL computes a gradient-correlation matrix $C_g^l$ from $c_g^l = (dL/dh^l)^T(dL/dh^l)$, extracts eigenvectors corresponding to the smallest eigenvalues to form a projection $V$, and updates weights via $W^l_{t_k} = W^l_{t_{k-1}} + V^l_{t_{k-1}} W^l_{trainable}$ while keeping $W^l_{t_{k-1}}$ frozen. A task-specific BCE loss $loss(i,y) = y \log(i) + (1-y)\log(1-i)$ confines learning to the current task, preventing interference with previous logits. Experiments on MNIST and Fashion MNIST show that GCSL can significantly mitigate forgetting, with performance depending on subspace size and which layer is targeted, and competitive results versus GPM in several settings. The approach offers a practical, configurable alternative for continual learning that can integrate with standard optimizers and potentially combine with replay or contrastive strategies in future work.

Abstract

Efficient continual learning techniques have been a topic of significant research over the last few years. A fundamental problem with such learning is severe degradation of performance on previously learned tasks, known also as catastrophic forgetting. This paper introduces a novel method to reduce catastrophic forgetting in the context of incremental class learning called Gradient Correlation Subspace Learning (GCSL). The method detects a subspace of the weights that is least affected by previous tasks and projects the weights to train for the new task into said subspace. The method can be applied to one or more layers of a given network architectures and the size of the subspace used can be altered from layer to layer and task to task. Code will be available at \href{https://github.com/vgthengane/GCSL}{https://github.com/vgthengane/GCSL}

Gradient Correlation Subspace Learning against Catastrophic Forgetting

TL;DR

Gradient Correlation Subspace Learning (GCSL) addresses catastrophic forgetting in incremental class learning by freezing prior weights and learning new tasks within a per-layer subspace defined by the gradient correlation structure. For each layer, GCSL computes a gradient-correlation matrix from , extracts eigenvectors corresponding to the smallest eigenvalues to form a projection , and updates weights via while keeping frozen. A task-specific BCE loss confines learning to the current task, preventing interference with previous logits. Experiments on MNIST and Fashion MNIST show that GCSL can significantly mitigate forgetting, with performance depending on subspace size and which layer is targeted, and competitive results versus GPM in several settings. The approach offers a practical, configurable alternative for continual learning that can integrate with standard optimizers and potentially combine with replay or contrastive strategies in future work.

Abstract

Efficient continual learning techniques have been a topic of significant research over the last few years. A fundamental problem with such learning is severe degradation of performance on previously learned tasks, known also as catastrophic forgetting. This paper introduces a novel method to reduce catastrophic forgetting in the context of incremental class learning called Gradient Correlation Subspace Learning (GCSL). The method detects a subspace of the weights that is least affected by previous tasks and projects the weights to train for the new task into said subspace. The method can be applied to one or more layers of a given network architectures and the size of the subspace used can be altered from layer to layer and task to task. Code will be available at \href{https://github.com/vgthengane/GCSL}{https://github.com/vgthengane/GCSL}
Paper Structure (30 sections, 5 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 30 sections, 5 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: The three continual learning scenarios for the first two tasks of the MNIST dataset. The top shows the first task and the bottom shows the task following it. The pattern in the final logit layer for each learning type is extended for $T_{j>(i+1)}$ tasks. hsu2019reevaluating
  • Figure 2: Results for different configuration sizes on all layers the MNIST dataset. The accuracy for the accumulative validation set at the end of the final task training. The $Baseline$ serves as an upper bound to the possible performance of the given network size, and ${L1:0, L2:0}$ serves as a lower bound for the worst possible performance by learning incrementally without the GCSL technique. Note that the ${L1:20, L2:20}$ is the largest possible gradient subspace configuration.
  • Figure 3: Results for different layers configuration using the size with the best performance from Figure 2 on the MNIST dataset. Note that the ${L1:10, L2:10}$ means the GCSL method was performed on both layers, ${L1:10, L2:0}$ means the method was performed on the first layer, and ${L1:0, L2:10}$ means the method was performed on the second layer.
  • Figure 4: Results for different configuration sizes on all layers on the Fashion MNIST dataset. The accuracy for the accumulative validation set at the end of the final task training. The $Baseline$ serves as an upper bound to the possible performance of the given network size, and ${L1:0, L2:0}$ serves as a lower bound for the worst possible performance by learning incrementally without the GCSL technique. Note that the ${L1:40, L2:20}$ is the largest possible gradient subspace configuration.
  • Figure 5: Results for different layers configuration using the size with the best performance from Figure 4 on the Fashion MNIST dataset. Note that the ${L1:20, L2:10}$ means the GCSL method was performed on both layers, ${L1:20, L2:0}$ means the method was performed on the first layer, and ${L1:0, L2:10}$ means the method was performed on the second layer.