Table of Contents
Fetching ...

Mitigating Forgetting in Continual Learning with Selective Gradient Projection

Anika Singh, Aayush Dhaulakhandi, Varun Chopade, Likhith Malipati, David Martinez, Kevin Zhu

Abstract

As neural networks are increasingly deployed in dynamic environments, they face the challenge of catastrophic forgetting, the tendency to overwrite previously learned knowledge when adapting to new tasks, resulting in severe performance degradation on earlier tasks. We propose Selective Forgetting-Aware Optimization (SFAO), a dynamic method that regulates gradient directions via cosine similarity and per-layer gating, enabling controlled forgetting while balancing plasticity and stability. SFAO selectively projects, accepts, or discards updates using a tunable mechanism with efficient Monte Carlo approximation. Experiments on standard continual learning benchmarks show that SFAO achieves competitive accuracy with markedly lower memory cost, a 90$\%$ reduction, and improved forgetting on MNIST datasets, making it suitable for resource-constrained scenarios.

Mitigating Forgetting in Continual Learning with Selective Gradient Projection

Abstract

As neural networks are increasingly deployed in dynamic environments, they face the challenge of catastrophic forgetting, the tendency to overwrite previously learned knowledge when adapting to new tasks, resulting in severe performance degradation on earlier tasks. We propose Selective Forgetting-Aware Optimization (SFAO), a dynamic method that regulates gradient directions via cosine similarity and per-layer gating, enabling controlled forgetting while balancing plasticity and stability. SFAO selectively projects, accepts, or discards updates using a tunable mechanism with efficient Monte Carlo approximation. Experiments on standard continual learning benchmarks show that SFAO achieves competitive accuracy with markedly lower memory cost, a 90 reduction, and improved forgetting on MNIST datasets, making it suitable for resource-constrained scenarios.

Paper Structure

This paper contains 55 sections, 1 theorem, 30 equations, 2 figures, 11 tables, 1 algorithm.

Key Result

Proposition 2.1

If $u = (I - P_\mathcal{S})\,g_t$, then $g^\top u = 0$ for all $g\in\mathcal{S}$, and thus for any past task $i$ whose gradient $g_i\in\mathcal{S}$ we have $\Delta \mathcal{L}_i = O(\eta^2)$. Hence orthogonal projection removes first-order forgetting on tasks whose gradients are represented in $\mat

Figures (2)

  • Figure 1: Forgetting curve per baseline on Split MNIST. Forgetting is averaged across previously seen tasks after each new task. There are a total of four tasks.
  • Figure 2: Geometry of the SFAO update. Green ($U_{\text{accept}}$): when the current gradient is sufficiently similar to the buffer $\mathcal{B}$, the update is accepted as is. Blue ($U_{\text{project}}$): otherwise the gradient is orthogonally projected off the subspace spanned by the buffered past gradients $\{g_i\}$ to mitigate interference.

Theorems & Definitions (1)

  • Proposition 2.1: First-order safety for represented tasks