The Information of Large Language Model Geometry
Zhiquan Tan, Chenghai Li, Weiran Huang
TL;DR
The paper tackles how information is encoded in LLM embeddings and why model scaling follows a power-law. It combines empirical entropy analyses with an information-theoretic framework, deriving a conditional-entropy based explanation for scaling laws and linking information gain to ridge regression in autoregressive generation. It further shows that token information is distributed across the entire context, with Lasso-based token selection sometimes outperforming attention, and demonstrates sentence-level semantic structure via mean embeddings and covariance-based distances. Overall, the work provides a principled, information-theoretic view of LLM geometry, scaling behavior, and context integration with potential implications for data curation and model design.
Abstract
This paper investigates the information encoded in the embeddings of large language models (LLMs). We conduct simulations to analyze the representation entropy and discover a power law relationship with model sizes. Building upon this observation, we propose a theory based on (conditional) entropy to elucidate the scaling law phenomenon. Furthermore, we delve into the auto-regressive structure of LLMs and examine the relationship between the last token and previous context tokens using information theory and regression techniques. Specifically, we establish a theoretical connection between the information gain of new tokens and ridge regression. Additionally, we explore the effectiveness of Lasso regression in selecting meaningful tokens, which sometimes outperforms the closely related attention weights. Finally, we conduct controlled experiments, and find that information is distributed across tokens, rather than being concentrated in specific "meaningful" tokens alone.
