Table of Contents
Fetching ...

Asymptotic analysis of shallow and deep forgetting in replay with Neural Collapse

Giulia Lanzillotta, Damiano Meier, Thomas Hofmann

TL;DR

This work analyzes why replay buffers in continual learning preserve internal feature geometry far better than they preserve alignment between the learned head and the population distribution. By extending Neural Collapse to sequential training, it shows that small buffers robustly anchor feature-space geometry (preventing deep forgetting) while shallow forgetting requires much larger buffers, with the gap depending on head architecture (single-head vs multi-head) and the task regime. Introducing an OOD-centered perspective, the authors derive a mixture model for replay that interpolates between NC-like and OOD representations, and prove that any non-zero replay preserves asymptotic separability in the NC subspace. The findings reveal a fundamental replay efficiency gap and suggest that correcting NC-induced statistical artifacts could enable robust performance with minimal replay, informing buffer sizing and training regimens for practical continual-learning systems.

Abstract

A persistent paradox in continual learning (CL) is that neural networks often retain linearly separable representations of past tasks even when their output predictions fail. We formalize this distinction as the gap between deep feature-space and shallow classifier-level forgetting. We reveal a critical asymmetry in Experience Replay: while minimal buffers successfully anchor feature geometry and prevent deep forgetting, mitigating shallow forgetting typically requires substantially larger buffer capacities. To explain this, we extend the Neural Collapse framework to the sequential setting. We characterize deep forgetting as a geometric drift toward out-of-distribution subspaces and prove that any non-zero replay fraction asymptotically guarantees the retention of linear separability. Conversely, we identify that the "strong collapse" induced by small buffers leads to rank-deficient covariances and inflated class means, effectively blinding the classifier to true population boundaries. By unifying CL with out-of-distribution detection, our work challenges the prevailing reliance on large buffers, suggesting that explicitly correcting these statistical artifacts could unlock robust performance with minimal replay.

Asymptotic analysis of shallow and deep forgetting in replay with Neural Collapse

TL;DR

This work analyzes why replay buffers in continual learning preserve internal feature geometry far better than they preserve alignment between the learned head and the population distribution. By extending Neural Collapse to sequential training, it shows that small buffers robustly anchor feature-space geometry (preventing deep forgetting) while shallow forgetting requires much larger buffers, with the gap depending on head architecture (single-head vs multi-head) and the task regime. Introducing an OOD-centered perspective, the authors derive a mixture model for replay that interpolates between NC-like and OOD representations, and prove that any non-zero replay preserves asymptotic separability in the NC subspace. The findings reveal a fundamental replay efficiency gap and suggest that correcting NC-induced statistical artifacts could enable robust performance with minimal replay, informing buffer sizing and training regimens for practical continual-learning systems.

Abstract

A persistent paradox in continual learning (CL) is that neural networks often retain linearly separable representations of past tasks even when their output predictions fail. We formalize this distinction as the gap between deep feature-space and shallow classifier-level forgetting. We reveal a critical asymmetry in Experience Replay: while minimal buffers successfully anchor feature geometry and prevent deep forgetting, mitigating shallow forgetting typically requires substantially larger buffer capacities. To explain this, we extend the Neural Collapse framework to the sequential setting. We characterize deep forgetting as a geometric drift toward out-of-distribution subspaces and prove that any non-zero replay fraction asymptotically guarantees the retention of linear separability. Conversely, we identify that the "strong collapse" induced by small buffers leads to rank-deficient covariances and inflated class means, effectively blinding the classifier to true population boundaries. By unifying CL with out-of-distribution detection, our work challenges the prevailing reliance on large buffers, suggesting that explicitly correcting these statistical artifacts could unlock robust performance with minimal replay.

Paper Structure

This paper contains 61 sections, 18 theorems, 92 equations, 25 figures, 4 tables.

Key Result

Theorem 1

Let $X_c$ be OOD inputs (def:OOD-main) for a feature map $\phi_t$ trained with a sufficiently small learning rate $\eta$ and weight decay $\lambda$. Let $\beta_t$ denote the observed centered class-mean norm as by def:NC2. In the terminal phase ($t \ge t_0$), the feature distribution of $X_c$ has me

Figures (25)

  • Figure 1: Evolution of decision boundaries and feature separability. PCA evolution of two Cifar10 classes (1% replay). Replay samples are highlighted with a black edge. While features retain separability across tasks (low deep forgetting), the classifier optimization becomes under-determined: multiple "buffer-optimal" boundaries (dashed brown) perfectly classify the stored samples but largely fail to align to the true population boundary (dashed green), resulting in shallow forgetting.
  • Figure 2: Replay efficiency gap. Forgetting decays at different rates in the feature space and the classifier head, producing a persistent gap between deep and shallow forgetting. Increasing the replay buffer closes this gap only gradually, with substantial buffer sizes required for convergence. See \ref{['apsec:fig-details']} for details.
  • Figure 3: NC metrics in sequential training (Cifar100, ResNet with 5% replay). NC emerges across all tasks. In DIL, the ETF structure ($\mathcal{NC}2$) remains stable; in CIL, it evolves as class count increases; in TIL, it arises per-head with variable cross-task alignment. Highlighted in green is the asymptotic limit of the NC metrics. See \ref{['apsec:fig-details']} for details.
  • Figure 4: Projection of $\tilde{\mu}_c(t)$ onto $S_t$ (Cifar100, no replay). The population means of past and future tasks exhibit equivalent (near-zero) norms when projected onto the active subspace $S_t$.
  • Figure 5: Empirical validation for the theoretical model of feature space structure (Cifar100, ResNet with 5% replay). Plot shows the average over all past tasks after training the last task for four metrics. Results are shown for different buffer sizes and weight decay parameters (different lines). Details in \ref{['apsec:fig-details']}.
  • ...and 20 more figures

Theorems & Definitions (42)

  • Definition 1: Out-of-distribution (OOD)
  • Theorem 1: Asymptotic distribution of OOD data
  • Corollary 1: Collapse to null distribution
  • Theorem 2: Lower bound on OOD linear separability
  • Theorem 3: Lower bound on separability with replay
  • Corollary 2
  • Definition 2: Linear Separability
  • Definition 3: Mahalanobis Distance
  • Lemma 1: Lower Bound to Mahalanobis Distance
  • proof
  • ...and 32 more