Table of Contents
Fetching ...

Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting

Ran Cheng

TL;DR

A systematic taxonomy of 15+ closed research directions -- including the Hebbian null result (frozen random features outperform learned features, CFlow's $\theta_0$-memorizer phenomenon, and the $S_N$ symmetry barrier to column specialization -- provides the community with precisely diagnosed negative results.

Abstract

Catastrophic forgetting remains a central challenge in continual learning (CL), yet lacks a unified information-theoretic explanation for why some architectures forget catastrophically while others do not. We introduce \emph{Context Channel Capacity} ($C_\mathrm{ctx}$), the mutual information between a CL architecture's context signal and its generated parameters, and prove that zero forgetting requires $C_\mathrm{ctx} \geq H(T)$, where $H(T)$ is the task identity entropy. We establish an \emph{Impossibility Triangle} -- zero forgetting, online learning, and finite parameters cannot be simultaneously satisfied by sequential state-based learners -- and show that conditional regeneration architectures (HyperNetworks) bypass this triangle by redefining parameters as function values rather than states. We validate this framework across 8 CL methods on Split-MNIST (1,130+ experiments over 86 days, 4 seeds each), showing that $C_\mathrm{ctx}$ perfectly predicts forgetting behavior: methods with $C_\mathrm{ctx} = 0$ (NaiveSGD, EWC, SI, LwF, CFlow) exhibit catastrophic forgetting (6--97\%), while methods with $C_\mathrm{ctx} \approx 1$ (HyperNetwork) achieve zero forgetting (98.8\% ACC). We further propose \emph{Wrong-Context Probing} (P5), a practical diagnostic protocol for measuring $C_\mathrm{ctx}$, and extend the framework to CIFAR-10 via a novel \emph{Gradient Context Encoder} that closes the oracle gap from 23.3pp to 0.7pp. A systematic taxonomy of 15+ closed research directions -- including the Hebbian null result (frozen random features outperform learned features), CFlow's $θ_0$-memorizer phenomenon, and the $S_N$ symmetry barrier to column specialization -- provides the community with precisely diagnosed negative results. Our central design principle: \emph{architecture over algorithm} -- the context pathway must be structurally unbypassable.

Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting

TL;DR

A systematic taxonomy of 15+ closed research directions -- including the Hebbian null result (frozen random features outperform learned features, CFlow's -memorizer phenomenon, and the symmetry barrier to column specialization -- provides the community with precisely diagnosed negative results.

Abstract

Catastrophic forgetting remains a central challenge in continual learning (CL), yet lacks a unified information-theoretic explanation for why some architectures forget catastrophically while others do not. We introduce \emph{Context Channel Capacity} (), the mutual information between a CL architecture's context signal and its generated parameters, and prove that zero forgetting requires , where is the task identity entropy. We establish an \emph{Impossibility Triangle} -- zero forgetting, online learning, and finite parameters cannot be simultaneously satisfied by sequential state-based learners -- and show that conditional regeneration architectures (HyperNetworks) bypass this triangle by redefining parameters as function values rather than states. We validate this framework across 8 CL methods on Split-MNIST (1,130+ experiments over 86 days, 4 seeds each), showing that perfectly predicts forgetting behavior: methods with (NaiveSGD, EWC, SI, LwF, CFlow) exhibit catastrophic forgetting (6--97\%), while methods with (HyperNetwork) achieve zero forgetting (98.8\% ACC). We further propose \emph{Wrong-Context Probing} (P5), a practical diagnostic protocol for measuring , and extend the framework to CIFAR-10 via a novel \emph{Gradient Context Encoder} that closes the oracle gap from 23.3pp to 0.7pp. A systematic taxonomy of 15+ closed research directions -- including the Hebbian null result (frozen random features outperform learned features), CFlow's -memorizer phenomenon, and the symmetry barrier to column specialization -- provides the community with precisely diagnosed negative results. Our central design principle: \emph{architecture over algorithm} -- the context pathway must be structurally unbypassable.
Paper Structure (86 sections, 14 theorems, 83 equations, 3 figures, 12 tables)

This paper contains 86 sections, 14 theorems, 83 equations, 3 figures, 12 tables.

Key Result

Theorem 1

For a deterministic sequential learner with finite parameter capacity $C = d \cdot \log_2(1/\delta)$ bits and independent tasks $\{D_k\}_{k=1}^K$, the mutual information between the final parameters and any past task dataset satisfies:

Figures (3)

  • Figure 1: DND neuron overlap and template similarity analysis. No emergent task specialization is observed: neurons are shared across tasks, and templates fail to differentiate.
  • Figure 2: Evidence for the $S_N$ symmetry barrier. (a) HSPC-T fails to produce column specialization. (b) Metabolic pruning's $\chi$ statistic is uncorrelated with column importance.
  • Figure 3: CFlow probing results. P5 (wrong context) causes zero accuracy change, while P6 (random $\theta_0$) causes a $-40$pp drop. This confirms CFlow is a "$\theta_0$ memorizer"---the context pathway is structurally dead.

Theorems & Definitions (37)

  • Definition 1: Continual Learning Problem
  • Definition 2: Forgetting
  • Definition 3: Parameter Capacity
  • Theorem 1: Forgetting is Information-Theoretically Inevitable
  • proof
  • Corollary 2: Forgetting Lower Bound
  • proof
  • Theorem 3: Continual Learning Impossibility Triangle
  • proof
  • Remark 1: Sharpness of the triangle
  • ...and 27 more