Table of Contents
Fetching ...

Sparse Semantic Dimension as a Generalization Certificate for LLMs

Dibyanayan Bandyopadhyay, Asif Ekbal

TL;DR

This work tackles the Generalization Paradox of large language models by proposing Sparse Semantic Dimension (SSD), a data-dependent complexity measure derived from sparse representations learned by a Sparse Autoencoder (SAE) over model activations. Treating the LLM and SAE as frozen oracles, it derives a high-probability generalization bound that scales with the active feature pool size $P$ rather than the parameter count, enabling non-vacuous certificates on real models like GPT-2 Small and Gemma-2B. Empirically, larger models exhibit sharper, more compressible semantic dictionaries, requiring fewer calibration samples to certify generalization, and the framework provides a practical safety monitor via a “feature explosion” signal under out-of-distribution inputs. The approach links interpretability and compression to formal guarantees, offering runtime indicators (per-input sparsity) for epistemic uncertainty and paving the way for layer-wise and dynamic extensions of SSD-based certifiable generalization.

Abstract

Standard statistical learning theory predicts that Large Language Models (LLMs) should overfit because their parameter counts vastly exceed the number of training tokens. Yet, in practice, they generalize robustly. We propose that the effective capacity controlling generalization lies in the geometry of the model's internal representations: while the parameter space is high-dimensional, the activation states lie on a low-dimensional, sparse manifold. To formalize this, we introduce the Sparse Semantic Dimension (SSD), a complexity measure derived from the active feature vocabulary of a Sparse Autoencoder (SAE) trained on the model's layers. Treating the LLM and SAE as frozen oracles, we utilize this framework to attribute the model's generalization capabilities to the sparsity of the dictionary rather than the total parameter count. Empirically, we validate this framework on GPT-2 Small and Gemma-2B, demonstrating that our bound provides non-vacuous certificates at realistic sample sizes. Crucially, we uncover a counter-intuitive "feature sharpness" scaling law: despite being an order of magnitude larger, Gemma-2B requires significantly fewer calibration samples to identify its active manifold compared to GPT-2, suggesting that larger models learn more compressible, distinct semantic structures. Finally, we show that this framework functions as a reliable safety monitor: out-of-distribution inputs trigger a measurable "feature explosion" (a sharp spike in active features), effectively signaling epistemic uncertainty through learned feature violation. Code is available at: https://github.com/newcodevelop/sparse-semantic-dimension.

Sparse Semantic Dimension as a Generalization Certificate for LLMs

TL;DR

This work tackles the Generalization Paradox of large language models by proposing Sparse Semantic Dimension (SSD), a data-dependent complexity measure derived from sparse representations learned by a Sparse Autoencoder (SAE) over model activations. Treating the LLM and SAE as frozen oracles, it derives a high-probability generalization bound that scales with the active feature pool size rather than the parameter count, enabling non-vacuous certificates on real models like GPT-2 Small and Gemma-2B. Empirically, larger models exhibit sharper, more compressible semantic dictionaries, requiring fewer calibration samples to certify generalization, and the framework provides a practical safety monitor via a “feature explosion” signal under out-of-distribution inputs. The approach links interpretability and compression to formal guarantees, offering runtime indicators (per-input sparsity) for epistemic uncertainty and paving the way for layer-wise and dynamic extensions of SSD-based certifiable generalization.

Abstract

Standard statistical learning theory predicts that Large Language Models (LLMs) should overfit because their parameter counts vastly exceed the number of training tokens. Yet, in practice, they generalize robustly. We propose that the effective capacity controlling generalization lies in the geometry of the model's internal representations: while the parameter space is high-dimensional, the activation states lie on a low-dimensional, sparse manifold. To formalize this, we introduce the Sparse Semantic Dimension (SSD), a complexity measure derived from the active feature vocabulary of a Sparse Autoencoder (SAE) trained on the model's layers. Treating the LLM and SAE as frozen oracles, we utilize this framework to attribute the model's generalization capabilities to the sparsity of the dictionary rather than the total parameter count. Empirically, we validate this framework on GPT-2 Small and Gemma-2B, demonstrating that our bound provides non-vacuous certificates at realistic sample sizes. Crucially, we uncover a counter-intuitive "feature sharpness" scaling law: despite being an order of magnitude larger, Gemma-2B requires significantly fewer calibration samples to identify its active manifold compared to GPT-2, suggesting that larger models learn more compressible, distinct semantic structures. Finally, we show that this framework functions as a reliable safety monitor: out-of-distribution inputs trigger a measurable "feature explosion" (a sharp spike in active features), effectively signaling epistemic uncertainty through learned feature violation. Code is available at: https://github.com/newcodevelop/sparse-semantic-dimension.
Paper Structure (36 sections, 6 theorems, 36 equations, 4 figures, 3 tables)

This paper contains 36 sections, 6 theorems, 36 equations, 4 figures, 3 tables.

Key Result

Lemma 1

For any two predictors $f,g$, In particular,

Figures (4)

  • Figure 1: Sensitivity of Generalization Bounds to Concept Pool Calibration. We plot the theoretical generalization certificate (in bits) against evaluation sample size $N$ for GPT-2 Small (blue) and Gemma-2B (red). The left panel utilizes a minimal calibration set ($N_{cal}=1000$), while the right panel utilizes a moderate calibration set ($N_{cal}=6250$).
  • Figure 2: Decomposition of the Generalization Bound Components. We visualize the contribution of Risk ($R$), Gap ($\epsilon$), Mismatch ($\eta B$), and Complexity ($\Omega$) to the total bound. The plots compare In-Distribution (English) vs. Far-OOD (Random Noise) under the constraint Top-k=64. The bound certifies generalization for English via low risk and low pool mismatch. Conversely, it rejects Noise due to maximal empirical risk and a massive spike in pool mismatch (indicated in red).
  • Figure 3: The Complexity Shift. Histograms of active feature counts ($k$) for GPT-2 and Gemma-2B. GPT-2 shows a clear distributional shift to the right (higher complexity) for noise. Gemma-2B reveals a "heavy-tailed" failure mode on Far-OOD data (Max $k > 1500$) and a "compression" mode on Code (Shift Left), accurately reflecting its training distribution alignment.
  • Figure 4: Ablation: Semantic Specificity. Histograms of the per-sequence reconstruction gap ($\epsilon_{loss}$). Green (Real SAE): The error is tightly clustered near 0 bits, indicating high semantic fidelity. Red (Shuffled): Permuting the feature indices—while maintaining identical per-sample sparsity $k$—causes the error distribution to shift distinctively to the right, adopting a Gaussian-like profile (Mean shift: $\approx 6.5$ bits for GPT-2, $\approx 8.5$ bits for Gemma). This proves that the generalization guarantee depends on the specific semantic alignment of the activated features, not just statistical sparsity.

Theorems & Definitions (14)

  • Definition 1: Sparse Autoencoder Class
  • Definition 2: Reconstruction Inefficiency
  • Lemma 1: Decomposition of risk via loss gap
  • proof
  • Lemma 2: Pool mismatch bound
  • proof
  • Lemma 3: Uniform convergence for finite classes (Occam bound)
  • proof
  • Lemma 4: Counting pools
  • proof
  • ...and 4 more