Table of Contents
Fetching ...

Structural Sensitivity in Compressed Transformers: Error Propagation, Lyapunov Stability, and Formally Verified Bounds

Abhinaba Basu

Abstract

A single matrix out of 468 in GPT-2 Small can increase perplexity by 20,000x when compressed, revealing that transformer compression sensitivity spans five orders of magnitude. We map this sensitivity landscape across five architectures (117M-8B parameters), finding a consistent hierarchy: early-layer MLP up-projections are catastrophically sensitive while value projections compress nearly for free. This hierarchy is stable across compression levels, evaluation scales (2K-51K tokens), and datasets (WikiText-103, C4). Using Lyapunov stability theory, we show that residual connections contract compression errors by growing the hidden state faster than the error. Error contraction is necessary but not sufficient for compression tolerance: architecture-specific redundancy plays an equally important role, as demonstrated by the hybrid LFM2-2.6B degrading only 7x despite higher amplification than the fully-contracting GPT-2 Small (120x). Ten machine-checked Lean 4 theorems formalize per-matrix error bounds with no sorry markers; all bounds produce zero violations across 14,040+ configurations. We validate with downstream task evaluation (HellaSwag, ARC-Easy, Winogrande), activation-aware pruning on two architectures, and a Compression Fragility Index that rank-orders model robustness.

Structural Sensitivity in Compressed Transformers: Error Propagation, Lyapunov Stability, and Formally Verified Bounds

Abstract

A single matrix out of 468 in GPT-2 Small can increase perplexity by 20,000x when compressed, revealing that transformer compression sensitivity spans five orders of magnitude. We map this sensitivity landscape across five architectures (117M-8B parameters), finding a consistent hierarchy: early-layer MLP up-projections are catastrophically sensitive while value projections compress nearly for free. This hierarchy is stable across compression levels, evaluation scales (2K-51K tokens), and datasets (WikiText-103, C4). Using Lyapunov stability theory, we show that residual connections contract compression errors by growing the hidden state faster than the error. Error contraction is necessary but not sufficient for compression tolerance: architecture-specific redundancy plays an equally important role, as demonstrated by the hybrid LFM2-2.6B degrading only 7x despite higher amplification than the fully-contracting GPT-2 Small (120x). Ten machine-checked Lean 4 theorems formalize per-matrix error bounds with no sorry markers; all bounds produce zero violations across 14,040+ configurations. We validate with downstream task evaluation (HellaSwag, ARC-Easy, Winogrande), activation-aware pruning on two architectures, and a Compression Fragility Index that rank-orders model robustness.
Paper Structure (66 sections, 4 theorems, 8 equations, 9 figures, 13 tables)

This paper contains 66 sections, 4 theorems, 8 equations, 9 figures, 13 tables.

Key Result

Theorem 1

For any bounded linear map $f$ and inputs $x, x_c$ with $\|x - x_c\| \leq \delta$:

Figures (9)

  • Figure 1: Overview. (a) Three composable approximation stages, each with machine-checked error bounds. (b) Lyapunov $V(t)$ trajectories across five architectures (all measured via unified pipeline): GPT-2 Small contracts monotonically ($\rho_{\max} = 0.96$), while GPT-2 Medium's final layer amplifies dramatically ($\rho_{23} = 2.05$). LFM2-2.6B (hybrid) and Mistral-7B show moderate amplification; Qwen3-8B oscillates most. (c) Catastrophe threshold $\tau$ increases with model dimension but is modulated by architecture (Table \ref{['tab:threshold_all']}).
  • Figure 2: Structural sensitivity analysis. (a) Per-group compression regret on log scale ($12 \times 6$ heatmap). Each cell shows the perplexity increase when that group alone is compressed (sparsity=5%, rank=32). Layer 0 mlp_fc dominates at 459,904. (b) Component-type sensitivity hierarchy: MLP components (red) are consistently more sensitive than attention (blue), spanning a $46\times$ range.
  • Figure 3: The early-layer MLP catastrophe. (a) Layer 0 component decomposition across five compression levels: mlp_fc dominates at every level, reaching 460K regret at aggressive compression. V projections remain nearly free even at extreme settings. (b) mlp_fc regret by layer reveals a sharp phase transition between layers 2 and 3---a $260\times$ cliff separating catastrophic from negligible sensitivity.
  • Figure 4: Cumulative compression: forward (L0$\to$L11) vs. backward (L11$\to$L0). The $25{,}716\times$ gap at one layer compressed demonstrates that layer 0 dominates the sensitivity landscape.
  • Figure 5: Contraction decomposition. (a) GPT-2 Small: blue bars show hidden state growth ($\alpha^2$), red bars show error growth ($\rho \cdot \alpha^2$), green diamonds show contraction factor $\rho$. Layer 0's massive expansion ($\alpha^2 = 123$) produces the strongest contraction ($\rho = 0.25$). All layers contract. (b) Mistral-7B across 31 layers. Red-shaded regions highlight amplifying layers ($\rho > 1$).
  • ...and 4 more figures

Theorems & Definitions (4)

  • Theorem 1: Retrieval Bound
  • Theorem 2: Sparse Activation Bound
  • Theorem 3: Interpolation Bound
  • Theorem 4: Living Inference Composition Bound