Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting

Ran Cheng

Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting

Ran Cheng

TL;DR

A systematic taxonomy of 15+ closed research directions -- including the Hebbian null result (frozen random features outperform learned features, CFlow's $\theta_0$-memorizer phenomenon, and the $S_N$ symmetry barrier to column specialization -- provides the community with precisely diagnosed negative results.

Abstract

Catastrophic forgetting remains a central challenge in continual learning (CL), yet lacks a unified information-theoretic explanation for why some architectures forget catastrophically while others do not. We introduce \emph{Context Channel Capacity} ($C_\mathrm{ctx}$), the mutual information between a CL architecture's context signal and its generated parameters, and prove that zero forgetting requires $C_\mathrm{ctx} \geq H(T)$, where $H(T)$ is the task identity entropy. We establish an \emph{Impossibility Triangle} -- zero forgetting, online learning, and finite parameters cannot be simultaneously satisfied by sequential state-based learners -- and show that conditional regeneration architectures (HyperNetworks) bypass this triangle by redefining parameters as function values rather than states. We validate this framework across 8 CL methods on Split-MNIST (1,130+ experiments over 86 days, 4 seeds each), showing that $C_\mathrm{ctx}$ perfectly predicts forgetting behavior: methods with $C_\mathrm{ctx} = 0$ (NaiveSGD, EWC, SI, LwF, CFlow) exhibit catastrophic forgetting (6--97\%), while methods with $C_\mathrm{ctx} \approx 1$ (HyperNetwork) achieve zero forgetting (98.8\% ACC). We further propose \emph{Wrong-Context Probing} (P5), a practical diagnostic protocol for measuring $C_\mathrm{ctx}$, and extend the framework to CIFAR-10 via a novel \emph{Gradient Context Encoder} that closes the oracle gap from 23.3pp to 0.7pp. A systematic taxonomy of 15+ closed research directions -- including the Hebbian null result (frozen random features outperform learned features), CFlow's $θ_0$-memorizer phenomenon, and the $S_N$ symmetry barrier to column specialization -- provides the community with precisely diagnosed negative results. Our central design principle: \emph{architecture over algorithm} -- the context pathway must be structurally unbypassable.

Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting

TL;DR

A systematic taxonomy of 15+ closed research directions -- including the Hebbian null result (frozen random features outperform learned features, CFlow's

-memorizer phenomenon, and the

symmetry barrier to column specialization -- provides the community with precisely diagnosed negative results.

Abstract

), the mutual information between a CL architecture's context signal and its generated parameters, and prove that zero forgetting requires

, where

is the task identity entropy. We establish an \emph{Impossibility Triangle} -- zero forgetting, online learning, and finite parameters cannot be simultaneously satisfied by sequential state-based learners -- and show that conditional regeneration architectures (HyperNetworks) bypass this triangle by redefining parameters as function values rather than states. We validate this framework across 8 CL methods on Split-MNIST (1,130+ experiments over 86 days, 4 seeds each), showing that

perfectly predicts forgetting behavior: methods with

(NaiveSGD, EWC, SI, LwF, CFlow) exhibit catastrophic forgetting (6--97\%), while methods with

(HyperNetwork) achieve zero forgetting (98.8\% ACC). We further propose \emph{Wrong-Context Probing} (P5), a practical diagnostic protocol for measuring

, and extend the framework to CIFAR-10 via a novel \emph{Gradient Context Encoder} that closes the oracle gap from 23.3pp to 0.7pp. A systematic taxonomy of 15+ closed research directions -- including the Hebbian null result (frozen random features outperform learned features), CFlow's

-memorizer phenomenon, and the

symmetry barrier to column specialization -- provides the community with precisely diagnosed negative results. Our central design principle: \emph{architecture over algorithm} -- the context pathway must be structurally unbypassable.

Paper Structure (86 sections, 14 theorems, 83 equations, 3 figures, 12 tables)

This paper contains 86 sections, 14 theorems, 83 equations, 3 figures, 12 tables.

Introduction
The missing explanation.
Our answer: Context Channel Capacity.
Contributions.
Broader contribution: systematic negative results.
One-sentence summary.
Theoretical Framework
Continual Learning as Constrained Online Coding
Connection to rate-distortion theory.
The Information Bottleneck Chain
Relation to prior work.
The Impossibility Triangle
Context Channel Capacity
Intuition.
Paradigm Taxonomy via $C_\mathrm{ctx}$
...and 71 more sections

Key Result

Theorem 1

For a deterministic sequential learner with finite parameter capacity $C = d \cdot \log_2(1/\delta)$ bits and independent tasks $\{D_k\}_{k=1}^K$, the mutual information between the final parameters and any past task dataset satisfies:

Figures (3)

Figure 1: DND neuron overlap and template similarity analysis. No emergent task specialization is observed: neurons are shared across tasks, and templates fail to differentiate.
Figure 2: Evidence for the $S_N$ symmetry barrier. (a) HSPC-T fails to produce column specialization. (b) Metabolic pruning's $\chi$ statistic is uncorrelated with column importance.
Figure 3: CFlow probing results. P5 (wrong context) causes zero accuracy change, while P6 (random $\theta_0$) causes a $-40$pp drop. This confirms CFlow is a "$\theta_0$ memorizer"---the context pathway is structurally dead.

Theorems & Definitions (37)

Definition 1: Continual Learning Problem
Definition 2: Forgetting
Definition 3: Parameter Capacity
Theorem 1: Forgetting is Information-Theoretically Inevitable
proof
Corollary 2: Forgetting Lower Bound
proof
Theorem 3: Continual Learning Impossibility Triangle
proof
Remark 1: Sharpness of the triangle
...and 27 more

Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting

TL;DR

Abstract

Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (37)