Table of Contents
Fetching ...

Towards Atoms of Large Language Models

Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

TL;DR

Atom Theory introduces a principled framework for identifying fundamental representational units in LLMs by defining atoms through the atomic inner product (AIP). It provides two quantitative criteria, faithfulness ($R^2$) and stability ($q^*$), and proves that threshold-activated sparse autoencoders (TSAEs) can identifiably recover the atom set. Empirically, representation shifts under Euclidean metrics are corrected by AIP, neurons and features are not ideal atoms, and atoms with near-perfect faithfulness and stability emerge across multiple models, exhibiting strong monosemanticity. These results establish a rigorous basis for interpreting internal LLM representations and lay groundwork for reliable analysis and control of deep models.

Abstract

The fundamental representational units (FRUs) of large language models (LLMs) remain undefined, limiting further understanding of their underlying mechanisms. In this paper, we introduce Atom Theory to systematically define, evaluate, and identify such FRUs, which we term atoms. Building on the atomic inner product (AIP), a non-Euclidean metric that captures the underlying geometry of LLM representations, we formally define atoms and propose two key criteria for ideal atoms: faithfulness ($R^2$) and stability ($q^*$). We further prove that atoms are identifiable under threshold-activated sparse autoencoders (TSAEs). Empirically, we uncover a pervasive representation shift in LLMs and demonstrate that the AIP corrects this shift to capture the underlying representational geometry, thereby grounding Atom Theory. We find that two widely used units, neurons and features, fail to qualify as ideal atoms: neurons are faithful ($R^2=1$) but unstable ($q^*=0.5\%$), while features are more stable ($q^*=68.2\%$) but unfaithful ($R^2=48.8\%$). To find atoms of LLMs, leveraging atom identifiability under TSAEs, we show via large-scale experiments that reliable atom identification occurs only when the TSAE capacity matches the data scale. Guided by this insight, we identify FRUs with near-perfect faithfulness ($R^2=99.9\%$) and stability ($q^*=99.8\%$) across layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B, satisfying the criteria of ideal atoms statistically. Further analysis confirms that these atoms align with theoretical expectations and exhibit substantially higher monosemanticity. Overall, we propose and validate Atom Theory as a foundation for understanding the internal representations of LLMs. Code available at https://github.com/ChenhuiHu/towards_atoms.

Towards Atoms of Large Language Models

TL;DR

Atom Theory introduces a principled framework for identifying fundamental representational units in LLMs by defining atoms through the atomic inner product (AIP). It provides two quantitative criteria, faithfulness () and stability (), and proves that threshold-activated sparse autoencoders (TSAEs) can identifiably recover the atom set. Empirically, representation shifts under Euclidean metrics are corrected by AIP, neurons and features are not ideal atoms, and atoms with near-perfect faithfulness and stability emerge across multiple models, exhibiting strong monosemanticity. These results establish a rigorous basis for interpreting internal LLM representations and lay groundwork for reliable analysis and control of deep models.

Abstract

The fundamental representational units (FRUs) of large language models (LLMs) remain undefined, limiting further understanding of their underlying mechanisms. In this paper, we introduce Atom Theory to systematically define, evaluate, and identify such FRUs, which we term atoms. Building on the atomic inner product (AIP), a non-Euclidean metric that captures the underlying geometry of LLM representations, we formally define atoms and propose two key criteria for ideal atoms: faithfulness () and stability (). We further prove that atoms are identifiable under threshold-activated sparse autoencoders (TSAEs). Empirically, we uncover a pervasive representation shift in LLMs and demonstrate that the AIP corrects this shift to capture the underlying representational geometry, thereby grounding Atom Theory. We find that two widely used units, neurons and features, fail to qualify as ideal atoms: neurons are faithful () but unstable (), while features are more stable () but unfaithful (). To find atoms of LLMs, leveraging atom identifiability under TSAEs, we show via large-scale experiments that reliable atom identification occurs only when the TSAE capacity matches the data scale. Guided by this insight, we identify FRUs with near-perfect faithfulness () and stability () across layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B, satisfying the criteria of ideal atoms statistically. Further analysis confirms that these atoms align with theoretical expectations and exhibit substantially higher monosemanticity. Overall, we propose and validate Atom Theory as a foundation for understanding the internal representations of LLMs. Code available at https://github.com/ChenhuiHu/towards_atoms.

Paper Structure

This paper contains 45 sections, 12 theorems, 51 equations, 43 figures, 11 tables.

Key Result

Theorem 3.2

Let $\langle \bm{d}_i, \bm{d}_j \rangle_S = \bm{d}_i^\top \bm S \bm{d}_j$ be an atomic inner product with $\bm S\in \mathbb{R}^{H\times H}$ symmetric and positive definite. If the columns of $\bm D = [\bm{d}_1, \bm{d}_2, \cdots, \bm{d}_{|D|}]$ form the atom set such that $\forall i,\ \|\bm{d}_i\|_S

Figures (43)

  • Figure 1: Illustration of Atom Theory. (a) Atoms are defined based on the atomic inner product, inducing representability, sparsity, and separability. (b) Atoms are evaluated by faithfulness ($R^2$) and stability ($q^*$), measuring fidelity and stable-atom fraction. (c) Threshold-activated SAEs enable atom identification, with the encoder as an atom detector and the decoder as the target atom set.
  • Figure 2: Representation shift at the final layer across multiple LLMs under the Euclidean inner product, with the centroid of pairwise representation angles deviating from $90^\circ$. See Appendix \ref{['appendix:Representation_Shift']} for full results.
  • Figure 3: Correction of representation shift at the final layer across multiple LLMs via the atomic inner product, with the centroid of pairwise representation angles consistently approaching $90^\circ$. See Appendix \ref{['appendix:Representation_Shift']} for full results.
  • Figure 4: Comparison of neurons, features, and ideal atoms across all layers of different LLMs. Ideal atoms are required to exhibit both high faithfulness and high stability, corresponding to $R^2 = 1$ and $q^* = 1$, respectively. Values of $R^2$ below 0 are clipped to 0.
  • Figure 5: Matching TSAE capacity and data scale on Gemma2-2B (measured by $R^2$). Data $\times$ and TSAE $\times$ denote data scale and model capacity (interval 9,216). Red dashed lines mark the capacity range enabling reliable atom identification.
  • ...and 38 more figures

Theorems & Definitions (29)

  • Definition 3.1: Atomic Inner Product; AIP
  • Theorem 3.2: Explicit Form of the Atomic Inner Product
  • Corollary 3.2: Normalized Atomic Inner Product; NAIP
  • Remark
  • Definition 3.3: Sparsity Level
  • Remark
  • Definition 3.4: $\epsilon$-Approximately Orthogonal Atoms
  • Remark
  • Definition 3.5: Atoms
  • Remark
  • ...and 19 more