Multi-Scale Probabilistic Generation Theory: A Unified Information-Theoretic Framework for Hierarchical Structure in Large Language Models
Yukin Zhang, Qi Dong
TL;DR
MSPGT presents a unified information-theoretic framework that treats large language models as Hierarchical Variational Information Bottleneck systems with Global, Intermediate, and Local scales. By introducing architecture-conditioned compression weights and a rate–distortion objective, the theory makes falsifiable predictions about scale boundaries and their sensitivity to architecture and training. Through boundary-detection, scale-specific interventions, and cross-architecture robustness experiments on Llama and Qwen models, the work shows consistent multi-scale organization with architecture-dependent variations, notably stable Local-Global dynamics and variable Intermediate influence. This approach bridges descriptive interpretability with predictive theory, offering a principled path toward understanding how hierarchical structure emerges in LLMs and how it shifts across architectures. The findings have implications for model design, safety, and interpretability, providing concrete metrics and methods for probing internal information dynamics.
Abstract
Large Language Models (LLMs) exhibit remarkable emergent abilities but remain poorly understood at a mechanistic level. This paper introduces the Multi-Scale Probabilistic Generation Theory (MSPGT), a theoretical framework that models LLMs as Hierarchical Variational Information Bottleneck (H-VIB) systems. MSPGT posits that standard language modeling objectives implicitly optimize multi-scale information compression, leading to the spontaneous formation of three internal processing scales-Global, Intermediate, and Local. We formalize this principle, derive falsifiable predictions about boundary positions and architectural dependencies, and validate them through cross-model experiments combining multi-signal fusion and causal interventions. Results across Llama and Qwen families reveal consistent multi-scale organization but strong architecture-specific variations, partially supporting and refining the theory. MSPGT thus advances interpretability from descriptive observation toward predictive, information-theoretic understanding of how hierarchical structure emerges within large neural language models.
