Table of Contents
Fetching ...

Multi-Scale Probabilistic Generation Theory: A Unified Information-Theoretic Framework for Hierarchical Structure in Large Language Models

Yukin Zhang, Qi Dong

TL;DR

MSPGT presents a unified information-theoretic framework that treats large language models as Hierarchical Variational Information Bottleneck systems with Global, Intermediate, and Local scales. By introducing architecture-conditioned compression weights and a rate–distortion objective, the theory makes falsifiable predictions about scale boundaries and their sensitivity to architecture and training. Through boundary-detection, scale-specific interventions, and cross-architecture robustness experiments on Llama and Qwen models, the work shows consistent multi-scale organization with architecture-dependent variations, notably stable Local-Global dynamics and variable Intermediate influence. This approach bridges descriptive interpretability with predictive theory, offering a principled path toward understanding how hierarchical structure emerges in LLMs and how it shifts across architectures. The findings have implications for model design, safety, and interpretability, providing concrete metrics and methods for probing internal information dynamics.

Abstract

Large Language Models (LLMs) exhibit remarkable emergent abilities but remain poorly understood at a mechanistic level. This paper introduces the Multi-Scale Probabilistic Generation Theory (MSPGT), a theoretical framework that models LLMs as Hierarchical Variational Information Bottleneck (H-VIB) systems. MSPGT posits that standard language modeling objectives implicitly optimize multi-scale information compression, leading to the spontaneous formation of three internal processing scales-Global, Intermediate, and Local. We formalize this principle, derive falsifiable predictions about boundary positions and architectural dependencies, and validate them through cross-model experiments combining multi-signal fusion and causal interventions. Results across Llama and Qwen families reveal consistent multi-scale organization but strong architecture-specific variations, partially supporting and refining the theory. MSPGT thus advances interpretability from descriptive observation toward predictive, information-theoretic understanding of how hierarchical structure emerges within large neural language models.

Multi-Scale Probabilistic Generation Theory: A Unified Information-Theoretic Framework for Hierarchical Structure in Large Language Models

TL;DR

MSPGT presents a unified information-theoretic framework that treats large language models as Hierarchical Variational Information Bottleneck systems with Global, Intermediate, and Local scales. By introducing architecture-conditioned compression weights and a rate–distortion objective, the theory makes falsifiable predictions about scale boundaries and their sensitivity to architecture and training. Through boundary-detection, scale-specific interventions, and cross-architecture robustness experiments on Llama and Qwen models, the work shows consistent multi-scale organization with architecture-dependent variations, notably stable Local-Global dynamics and variable Intermediate influence. This approach bridges descriptive interpretability with predictive theory, offering a principled path toward understanding how hierarchical structure emerges in LLMs and how it shifts across architectures. The findings have implications for model design, safety, and interpretability, providing concrete metrics and methods for probing internal information dynamics.

Abstract

Large Language Models (LLMs) exhibit remarkable emergent abilities but remain poorly understood at a mechanistic level. This paper introduces the Multi-Scale Probabilistic Generation Theory (MSPGT), a theoretical framework that models LLMs as Hierarchical Variational Information Bottleneck (H-VIB) systems. MSPGT posits that standard language modeling objectives implicitly optimize multi-scale information compression, leading to the spontaneous formation of three internal processing scales-Global, Intermediate, and Local. We formalize this principle, derive falsifiable predictions about boundary positions and architectural dependencies, and validate them through cross-model experiments combining multi-signal fusion and causal interventions. Results across Llama and Qwen families reveal consistent multi-scale organization but strong architecture-specific variations, partially supporting and refining the theory. MSPGT thus advances interpretability from descriptive observation toward predictive, information-theoretic understanding of how hierarchical structure emerges within large neural language models.

Paper Structure

This paper contains 56 sections, 32 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Theory–Experiment loop overview.
  • Figure 2: Comparison of H-VIB Information Flow across architectures. The Llama and Qwen models show distinct information propagation dynamics across layers and semantic scales. Color intensity indicates information magnitude, with transitions highlighted at L–I (Local–Intermediate) and I–G (Intermediate–Global) boundaries.
  • Figure 3: Boundary detection signals for Qwen2.5-7B. The combined evidence curve (black) shows two prominent peaks at L2 (L-I boundary) and L20 (I-G boundary).
  • Figure 4: Boundary detection signals for Qwen1.5-7B. Detected boundaries: L2 (L-I) and L8 (I-G).
  • Figure 5: Boundary detection signals for Llama-3-8B. Detected boundaries: L13 (L-I) and L16 (I-G).
  • ...and 5 more figures