The Information of Large Language Model Geometry

Zhiquan Tan; Chenghai Li; Weiran Huang

The Information of Large Language Model Geometry

Zhiquan Tan, Chenghai Li, Weiran Huang

TL;DR

The paper tackles how information is encoded in LLM embeddings and why model scaling follows a power-law. It combines empirical entropy analyses with an information-theoretic framework, deriving a conditional-entropy based explanation for scaling laws and linking information gain to ridge regression in autoregressive generation. It further shows that token information is distributed across the entire context, with Lasso-based token selection sometimes outperforming attention, and demonstrates sentence-level semantic structure via mean embeddings and covariance-based distances. Overall, the work provides a principled, information-theoretic view of LLM geometry, scaling behavior, and context integration with potential implications for data curation and model design.

Abstract

This paper investigates the information encoded in the embeddings of large language models (LLMs). We conduct simulations to analyze the representation entropy and discover a power law relationship with model sizes. Building upon this observation, we propose a theory based on (conditional) entropy to elucidate the scaling law phenomenon. Furthermore, we delve into the auto-regressive structure of LLMs and examine the relationship between the last token and previous context tokens using information theory and regression techniques. Specifically, we establish a theoretical connection between the information gain of new tokens and ridge regression. Additionally, we explore the effectiveness of Lasso regression in selecting meaningful tokens, which sometimes outperforms the closely related attention weights. Finally, we conduct controlled experiments, and find that information is distributed across tokens, rather than being concentrated in specific "meaningful" tokens alone.

The Information of Large Language Model Geometry

TL;DR

Abstract

Paper Structure (16 sections, 11 theorems, 37 equations, 6 figures)

This paper contains 16 sections, 11 theorems, 37 equations, 6 figures.

Introduction
Background
Information-theoretic Quantities
Neural scaling law
Entropy in LLM (geometry)
Scaling law for dataset size
The information in the auto-regressive process
Information gain and ridge regression
Attention and Lasso
Does a token embedding contain all the information from its preceding context?
Related Work
Information theory.
Scaling law.
Conclusion
Ablation study
...and 1 more sections

Key Result

Lemma 3.6

Assume a LLM comprehends $n$ skills and $X \sim p_{\theta}$, $Y \sim p_{\text{skill}}$. Then the conditional entropy $\operatorname{H}(X | Y)$ obeys a power law relationship with $n$.

Figures (6)

Figure 1: The relationship of (normalized) entropy and model size.
Figure 2: A schematic view of corpus and skills.
Figure 3: Quantities related to the information gain.
Figure 4: Visualization by UMAP and PCA.
Figure 5: Sentence distance using JS distance.
...and 1 more figures

Theorems & Definitions (33)

Definition 2.1: Entropy
Definition 2.2: Conditional entropy
Definition 2.3: KL divergence
Definition 3.1
Definition 3.2: Instantaneous code coverelements
Definition 3.3: Expected code length coverelements
Definition 3.4: Comprehend a skill
Lemma 3.6
proof
Theorem 3.7: Scaling law with parameter
...and 23 more

The Information of Large Language Model Geometry

TL;DR

Abstract

The Information of Large Language Model Geometry

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (33)