Table of Contents
Fetching ...

The Information of Large Language Model Geometry

Zhiquan Tan, Chenghai Li, Weiran Huang

TL;DR

The paper tackles how information is encoded in LLM embeddings and why model scaling follows a power-law. It combines empirical entropy analyses with an information-theoretic framework, deriving a conditional-entropy based explanation for scaling laws and linking information gain to ridge regression in autoregressive generation. It further shows that token information is distributed across the entire context, with Lasso-based token selection sometimes outperforming attention, and demonstrates sentence-level semantic structure via mean embeddings and covariance-based distances. Overall, the work provides a principled, information-theoretic view of LLM geometry, scaling behavior, and context integration with potential implications for data curation and model design.

Abstract

This paper investigates the information encoded in the embeddings of large language models (LLMs). We conduct simulations to analyze the representation entropy and discover a power law relationship with model sizes. Building upon this observation, we propose a theory based on (conditional) entropy to elucidate the scaling law phenomenon. Furthermore, we delve into the auto-regressive structure of LLMs and examine the relationship between the last token and previous context tokens using information theory and regression techniques. Specifically, we establish a theoretical connection between the information gain of new tokens and ridge regression. Additionally, we explore the effectiveness of Lasso regression in selecting meaningful tokens, which sometimes outperforms the closely related attention weights. Finally, we conduct controlled experiments, and find that information is distributed across tokens, rather than being concentrated in specific "meaningful" tokens alone.

The Information of Large Language Model Geometry

TL;DR

The paper tackles how information is encoded in LLM embeddings and why model scaling follows a power-law. It combines empirical entropy analyses with an information-theoretic framework, deriving a conditional-entropy based explanation for scaling laws and linking information gain to ridge regression in autoregressive generation. It further shows that token information is distributed across the entire context, with Lasso-based token selection sometimes outperforming attention, and demonstrates sentence-level semantic structure via mean embeddings and covariance-based distances. Overall, the work provides a principled, information-theoretic view of LLM geometry, scaling behavior, and context integration with potential implications for data curation and model design.

Abstract

This paper investigates the information encoded in the embeddings of large language models (LLMs). We conduct simulations to analyze the representation entropy and discover a power law relationship with model sizes. Building upon this observation, we propose a theory based on (conditional) entropy to elucidate the scaling law phenomenon. Furthermore, we delve into the auto-regressive structure of LLMs and examine the relationship between the last token and previous context tokens using information theory and regression techniques. Specifically, we establish a theoretical connection between the information gain of new tokens and ridge regression. Additionally, we explore the effectiveness of Lasso regression in selecting meaningful tokens, which sometimes outperforms the closely related attention weights. Finally, we conduct controlled experiments, and find that information is distributed across tokens, rather than being concentrated in specific "meaningful" tokens alone.
Paper Structure (16 sections, 11 theorems, 37 equations, 6 figures)

This paper contains 16 sections, 11 theorems, 37 equations, 6 figures.

Key Result

Lemma 3.6

Assume a LLM comprehends $n$ skills and $X \sim p_{\theta}$, $Y \sim p_{\text{skill}}$. Then the conditional entropy $\operatorname{H}(X | Y)$ obeys a power law relationship with $n$.

Figures (6)

  • Figure 1: The relationship of (normalized) entropy and model size.
  • Figure 2: A schematic view of corpus and skills.
  • Figure 3: Quantities related to the information gain.
  • Figure 4: Visualization by UMAP and PCA.
  • Figure 5: Sentence distance using JS distance.
  • ...and 1 more figures

Theorems & Definitions (33)

  • Definition 2.1: Entropy
  • Definition 2.2: Conditional entropy
  • Definition 2.3: KL divergence
  • Definition 3.1
  • Definition 3.2: Instantaneous code coverelements
  • Definition 3.3: Expected code length coverelements
  • Definition 3.4: Comprehend a skill
  • Lemma 3.6
  • proof
  • Theorem 3.7: Scaling law with parameter
  • ...and 23 more