Table of Contents
Fetching ...

Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy

Karthik Viswanathan, Sang Eon Park

TL;DR

The paper tackles how large language models internalize higher-order statistics in next-token prediction by introducing a cumulant-expansion framework for the softmax entropy. It defines a center distribution $p(\boldsymbol{\mu})$ and derives cumulants $\kappa_n^{p_β(\mathbf{X})}$ to quantify higher-order correlations in logit space, then validates the approach with experiments on GPT-2 and Pythia using Pile-10K prompts. Key findings show that structured prompts induce depth-dependent rise and plateau in higher-order cumulants, shuffled prompts remain flat, cumulants grow monotonically during training, and math prompts exhibit distinct cumulant signatures from general text. Overall, cumulants offer a mathematically grounded, lightweight probe of feature-learning dynamics in high-dimensional neural networks, with potential implications for prompt design and diagnostic tooling. $\langle S(\mathbf{X})\rangle = S(\boldsymbol{\mu}) - \frac{1}{N} \sum_{n=2}^{\infty} \frac{\beta^n}{n!} \kappa^{p_β(\mathbf{X})}_n(-\sum_i\delta X_i)$ provides the central link between observable entropy and higher-order structure.

Abstract

We introduce a cumulant-expansion framework for quantifying how large language models (LLMs) internalize higher-order statistical structure during next-token prediction. By treating the softmax entropy of each layer's logit distribution as a perturbation around its "center" distribution, we derive closed-form cumulant observables that isolate successively higher-order correlations. Empirically, we track these cumulants in GPT-2 and Pythia models on Pile-10K prompts. (i) Structured prompts exhibit a characteristic rise-and-plateau profile across layers, whereas token-shuffled prompts remain flat, revealing the dependence of the cumulant profile on meaningful context. (ii) During training, all cumulants increase monotonically before saturating, directly visualizing the model's progression from capturing variance to learning skew, kurtosis, and higher-order statistical structures. (iii) Mathematical prompts show distinct cumulant signatures compared to general text, quantifying how models employ fundamentally different processing mechanisms for mathematical versus linguistic content. Together, these results establish cumulant analysis as a lightweight, mathematically grounded probe of feature-learning dynamics in high-dimensional neural networks.

Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy

TL;DR

The paper tackles how large language models internalize higher-order statistics in next-token prediction by introducing a cumulant-expansion framework for the softmax entropy. It defines a center distribution and derives cumulants to quantify higher-order correlations in logit space, then validates the approach with experiments on GPT-2 and Pythia using Pile-10K prompts. Key findings show that structured prompts induce depth-dependent rise and plateau in higher-order cumulants, shuffled prompts remain flat, cumulants grow monotonically during training, and math prompts exhibit distinct cumulant signatures from general text. Overall, cumulants offer a mathematically grounded, lightweight probe of feature-learning dynamics in high-dimensional neural networks, with potential implications for prompt design and diagnostic tooling. provides the central link between observable entropy and higher-order structure.

Abstract

We introduce a cumulant-expansion framework for quantifying how large language models (LLMs) internalize higher-order statistical structure during next-token prediction. By treating the softmax entropy of each layer's logit distribution as a perturbation around its "center" distribution, we derive closed-form cumulant observables that isolate successively higher-order correlations. Empirically, we track these cumulants in GPT-2 and Pythia models on Pile-10K prompts. (i) Structured prompts exhibit a characteristic rise-and-plateau profile across layers, whereas token-shuffled prompts remain flat, revealing the dependence of the cumulant profile on meaningful context. (ii) During training, all cumulants increase monotonically before saturating, directly visualizing the model's progression from capturing variance to learning skew, kurtosis, and higher-order statistical structures. (iii) Mathematical prompts show distinct cumulant signatures compared to general text, quantifying how models employ fundamentally different processing mechanisms for mathematical versus linguistic content. Together, these results establish cumulant analysis as a lightweight, mathematically grounded probe of feature-learning dynamics in high-dimensional neural networks.

Paper Structure

This paper contains 15 sections, 10 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Schematic of Logit Geometry Across Layers. Each triangle represents the probability simplex at a given layer, where colored dots correspond to token logits mapped to probabilities. The red circle labeled ‘c’ indicates the probability of the center of logits. The histograms illustrate the distribution of token-wise deviations from the center.
  • Figure 2: Cumulants in Structured and Shuffled Prompts. Left: Cumulants across layers for a single structured prompt (3218 from Pile 10K) in GPT-2 Large. Middle: Cumulants for the shuffled version of the same prompt. Right: Comparing mean softmax entropy (solid lines) and the entropy of the center (dotted lines) for both structured and shuffled prompts.
  • Figure 3: Evolution of Cumulants During Training. Left: Cumulants across layers of the Pythia-160M model tracked over training epochs. Right: mean softmax entropy (solid line) and softmax entropy of the center (dotted line) as a function of training.
  • Figure 4: Cumulants in DM Mathematics and Pile-CC Prompts. Comparison of normalized cumulants ($\kappa_2$ through $\kappa_7$) and entropy measures across model layers for mathematical prompts (DM Mathematics topic with $99$ prompts) versus general web text (Pile-CC topic with $570$ prompts) in GPT-2 Large. Each plot shows the mean (solid line) and standard deviation (shaded region) computed across multiple prompts from each dataset. Mathematical prompts exhibit distinct cumulant profiles compared to general text.
  • Figure 5: Cumulants Across Prompts. Cumulants of structured and shuffled versions of four randomly selected prompts from the Pile-10k dataset NeelNanda_pile-10k. Each panel corresponds to a different prompt, and the numbers on the axis title represent the prompt number in the Pile-10K dataset. Structured prompts consistently show richer and more structured cumulant profiles compared to their shuffled counterparts.
  • ...and 2 more figures