Table of Contents
Fetching ...

Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

Nilesh Sarkar, Dawar Jyoti Deka

Abstract

Knowledge distillation compresses large teachers into smaller students, but performance saturates at a loss floor that persists across training methods and objectives. We argue this floor is geometric: neural networks represent far more features than dimensions through superposition, and a student of width $d_S$ can encode at most $d_S \cdot g(α)$ features, where $g(α) = 1/((1-α)\ln\frac{1}{1-α})$ is a sparsity-dependent capacity function. Features beyond this budget are permanently lost, yielding an importance-weighted loss floor. We validate on a toy model (48 configurations, median accuracy >93%) and on Pythia-410M, where sparse autoencoders measure $F \approx 28{,}700$ features at $α\approx 0.992$ (critical width $d_S^* \approx 1{,}065$). Distillation into five student widths confirms the predicted monotonic floor ordering. The observed floor decomposes into a geometric component and a width-independent architectural baseline ($R^2 = 0.993$). Linear probing shows coarse concepts survive even 88% feature loss, revealing the floor arises from aggregate loss of fine-grained features in the importance distribution's long tail. Our results connect representation geometry to distillation limits and provide a practical tool for predicting distillation performance from SAE measurements alone.

Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

Abstract

Knowledge distillation compresses large teachers into smaller students, but performance saturates at a loss floor that persists across training methods and objectives. We argue this floor is geometric: neural networks represent far more features than dimensions through superposition, and a student of width can encode at most features, where is a sparsity-dependent capacity function. Features beyond this budget are permanently lost, yielding an importance-weighted loss floor. We validate on a toy model (48 configurations, median accuracy >93%) and on Pythia-410M, where sparse autoencoders measure features at (critical width ). Distillation into five student widths confirms the predicted monotonic floor ordering. The observed floor decomposes into a geometric component and a width-independent architectural baseline (). Linear probing shows coarse concepts survive even 88% feature loss, revealing the floor arises from aggregate loss of fine-grained features in the importance distribution's long tail. Our results connect representation geometry to distillation limits and provide a practical tool for predicting distillation performance from SAE measurements alone.

Paper Structure

This paper contains 40 sections, 1 theorem, 4 equations, 18 figures, 8 tables.

Key Result

Theorem 1

Under assumptions: (A1) the teacher's features are sparse with sparsity $\alpha$; (A2) the student allocates capacity optimally by importance scherlis2022polysemanticity; (A3) the student's hidden layer acts as the primary information bottleneck. Then for any student with width $d_\mathrm{S}$, defin $\blacktriangleleft$$\blacktriangleleft$

Figures (18)

  • Figure 1: Capacity function and critical width. Left:$g(\alpha)$ grows exponentially with sparsity; colored dots mark our toy model sparsity levels. Right: Critical width $d_\mathrm{S}^* = F/g(\alpha)$ shrinks as sparsity increases, since sparser features need fewer dimensions.
  • Figure 2: Loss floor vs. student width across 48 configurations (rows: $n$; columns: $\alpha$). Solid = actual (mean $\pm$ std, 20 seeds); dashed = formula (Eq. \ref{['eq:floor']}); dotted = $d_\mathrm{S}^*$. The formula captures both magnitude and shape.
  • Figure 3: Predicted vs. actual floor (log-log, 140 points). Left: Refined formula with $g(\alpha)$ (Pearson $r = 0.93$, MAPE $= 93.9\%$). Right: Naive formula assuming one feature/dim ($R^2 = -0.04$). Color = sparsity.
  • Figure 4: SAE training convergence. Layer 8 (blue) encodes a denser feature set; deeper layers 12 (orange) and 16 (green) show sparser, more selective representations with more feature death.
  • Figure 5: (a) Feature importance follows a power law: the top ${\sim}20$ features dominate, with a cliff near rank ${\sim}3{,}000$ where thousands reach ${\sim}10^{-7}$. This heavy tail is why compression works. (b) Predicted floor vs. width at layers 8, 12, 16. All layers agree ($d_\mathrm{S}^* \in [1065, 1186]$), converging near zero at $d_\mathrm{S} = 1024$.
  • ...and 13 more figures

Theorems & Definitions (2)

  • Theorem 1: Distillation minimum-width bound
  • proof : Proof sketch