Towards Atoms of Large Language Models

Chenhui Hu; Pengfei Cao; Yubo Chen; Kang Liu; Jun Zhao

Towards Atoms of Large Language Models

Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

TL;DR

Atom Theory introduces a principled framework for identifying fundamental representational units in LLMs by defining atoms through the atomic inner product (AIP). It provides two quantitative criteria, faithfulness ($R^2$) and stability ($q^*$), and proves that threshold-activated sparse autoencoders (TSAEs) can identifiably recover the atom set. Empirically, representation shifts under Euclidean metrics are corrected by AIP, neurons and features are not ideal atoms, and atoms with near-perfect faithfulness and stability emerge across multiple models, exhibiting strong monosemanticity. These results establish a rigorous basis for interpreting internal LLM representations and lay groundwork for reliable analysis and control of deep models.

Abstract

The fundamental representational units (FRUs) of large language models (LLMs) remain undefined, limiting further understanding of their underlying mechanisms. In this paper, we introduce Atom Theory to systematically define, evaluate, and identify such FRUs, which we term atoms. Building on the atomic inner product (AIP), a non-Euclidean metric that captures the underlying geometry of LLM representations, we formally define atoms and propose two key criteria for ideal atoms: faithfulness ($R^2$) and stability ($q^*$). We further prove that atoms are identifiable under threshold-activated sparse autoencoders (TSAEs). Empirically, we uncover a pervasive representation shift in LLMs and demonstrate that the AIP corrects this shift to capture the underlying representational geometry, thereby grounding Atom Theory. We find that two widely used units, neurons and features, fail to qualify as ideal atoms: neurons are faithful ($R^2=1$) but unstable ($q^*=0.5\%$), while features are more stable ($q^*=68.2\%$) but unfaithful ($R^2=48.8\%$). To find atoms of LLMs, leveraging atom identifiability under TSAEs, we show via large-scale experiments that reliable atom identification occurs only when the TSAE capacity matches the data scale. Guided by this insight, we identify FRUs with near-perfect faithfulness ($R^2=99.9\%$) and stability ($q^*=99.8\%$) across layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B, satisfying the criteria of ideal atoms statistically. Further analysis confirms that these atoms align with theoretical expectations and exhibit substantially higher monosemanticity. Overall, we propose and validate Atom Theory as a foundation for understanding the internal representations of LLMs. Code available at https://github.com/ChenhuiHu/towards_atoms.

Towards Atoms of Large Language Models

TL;DR

) and stability (

), and proves that threshold-activated sparse autoencoders (TSAEs) can identifiably recover the atom set. Empirically, representation shifts under Euclidean metrics are corrected by AIP, neurons and features are not ideal atoms, and atoms with near-perfect faithfulness and stability emerge across multiple models, exhibiting strong monosemanticity. These results establish a rigorous basis for interpreting internal LLM representations and lay groundwork for reliable analysis and control of deep models.

Abstract

) and stability (

). We further prove that atoms are identifiable under threshold-activated sparse autoencoders (TSAEs). Empirically, we uncover a pervasive representation shift in LLMs and demonstrate that the AIP corrects this shift to capture the underlying representational geometry, thereby grounding Atom Theory. We find that two widely used units, neurons and features, fail to qualify as ideal atoms: neurons are faithful (

) but unstable (

), while features are more stable (

) but unfaithful (

). To find atoms of LLMs, leveraging atom identifiability under TSAEs, we show via large-scale experiments that reliable atom identification occurs only when the TSAE capacity matches the data scale. Guided by this insight, we identify FRUs with near-perfect faithfulness (

) and stability (

) across layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B, satisfying the criteria of ideal atoms statistically. Further analysis confirms that these atoms align with theoretical expectations and exhibit substantially higher monosemanticity. Overall, we propose and validate Atom Theory as a foundation for understanding the internal representations of LLMs. Code available at https://github.com/ChenhuiHu/towards_atoms.

Towards Atoms of Large Language Models

TL;DR

Abstract

Towards Atoms of Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (43)

Theorems & Definitions (29)