Structural Sensitivity in Compressed Transformers: Error Propagation, Lyapunov Stability, and Formally Verified Bounds

Abhinaba Basu

Structural Sensitivity in Compressed Transformers: Error Propagation, Lyapunov Stability, and Formally Verified Bounds

Abhinaba Basu

Abstract

A single matrix out of 468 in GPT-2 Small can increase perplexity by 20,000x when compressed, revealing that transformer compression sensitivity spans five orders of magnitude. We map this sensitivity landscape across five architectures (117M-8B parameters), finding a consistent hierarchy: early-layer MLP up-projections are catastrophically sensitive while value projections compress nearly for free. This hierarchy is stable across compression levels, evaluation scales (2K-51K tokens), and datasets (WikiText-103, C4). Using Lyapunov stability theory, we show that residual connections contract compression errors by growing the hidden state faster than the error. Error contraction is necessary but not sufficient for compression tolerance: architecture-specific redundancy plays an equally important role, as demonstrated by the hybrid LFM2-2.6B degrading only 7x despite higher amplification than the fully-contracting GPT-2 Small (120x). Ten machine-checked Lean 4 theorems formalize per-matrix error bounds with no sorry markers; all bounds produce zero violations across 14,040+ configurations. We validate with downstream task evaluation (HellaSwag, ARC-Easy, Winogrande), activation-aware pruning on two architectures, and a Compression Fragility Index that rank-orders model robustness.

Structural Sensitivity in Compressed Transformers: Error Propagation, Lyapunov Stability, and Formally Verified Bounds

Abstract

Paper Structure (66 sections, 4 theorems, 8 equations, 9 figures, 13 tables)

This paper contains 66 sections, 4 theorems, 8 equations, 9 figures, 13 tables.

Introduction
Contributions.
The Living Inference Framework
Stage 1: Semantic Caching
Stage 2: Sparse Activation
Stage 3: Low-Rank Approximation
Composition Theorem
Formal Verification in Lean 4
Proof Architecture
Proof Strategy and Multi-Layer Bounds
Scope of formal guarantees.
Experimental Setup
Model.
Data.
Implementation.
...and 51 more sections

Key Result

Theorem 1

For any bounded linear map $f$ and inputs $x, x_c$ with $\|x - x_c\| \leq \delta$:

Figures (9)

Figure 1: Overview. (a) Three composable approximation stages, each with machine-checked error bounds. (b) Lyapunov $V(t)$ trajectories across five architectures (all measured via unified pipeline): GPT-2 Small contracts monotonically ($\rho_{\max} = 0.96$), while GPT-2 Medium's final layer amplifies dramatically ($\rho_{23} = 2.05$). LFM2-2.6B (hybrid) and Mistral-7B show moderate amplification; Qwen3-8B oscillates most. (c) Catastrophe threshold $\tau$ increases with model dimension but is modulated by architecture (Table \ref{['tab:threshold_all']}).
Figure 2: Structural sensitivity analysis. (a) Per-group compression regret on log scale ($12 \times 6$ heatmap). Each cell shows the perplexity increase when that group alone is compressed (sparsity=5%, rank=32). Layer 0 mlp_fc dominates at 459,904. (b) Component-type sensitivity hierarchy: MLP components (red) are consistently more sensitive than attention (blue), spanning a $46\times$ range.
Figure 3: The early-layer MLP catastrophe. (a) Layer 0 component decomposition across five compression levels: mlp_fc dominates at every level, reaching 460K regret at aggressive compression. V projections remain nearly free even at extreme settings. (b) mlp_fc regret by layer reveals a sharp phase transition between layers 2 and 3---a $260\times$ cliff separating catastrophic from negligible sensitivity.
Figure 4: Cumulative compression: forward (L0$\to$L11) vs. backward (L11$\to$L0). The $25{,}716\times$ gap at one layer compressed demonstrates that layer 0 dominates the sensitivity landscape.
Figure 5: Contraction decomposition. (a) GPT-2 Small: blue bars show hidden state growth ($\alpha^2$), red bars show error growth ($\rho \cdot \alpha^2$), green diamonds show contraction factor $\rho$. Layer 0's massive expansion ($\alpha^2 = 123$) produces the strongest contraction ($\rho = 0.25$). All layers contract. (b) Mistral-7B across 31 layers. Red-shaded regions highlight amplifying layers ($\rho > 1$).
...and 4 more figures

Theorems & Definitions (4)

Theorem 1: Retrieval Bound
Theorem 2: Sparse Activation Bound
Theorem 3: Interpolation Bound
Theorem 4: Living Inference Composition Bound

Structural Sensitivity in Compressed Transformers: Error Propagation, Lyapunov Stability, and Formally Verified Bounds

Abstract

Structural Sensitivity in Compressed Transformers: Error Propagation, Lyapunov Stability, and Formally Verified Bounds

Authors

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (4)