Table of Contents
Fetching ...

Stochastic Thermodynamics for Autoregressive Generative Models: A Non-Markovian Perspective

Takahiro Sagawa

Abstract

Autoregressive generative models -- including Transformers, recurrent neural networks, classical Kalman filters, state space models, and Mamba -- all generate sequences by sampling each output from a deterministic summary of the past, producing genuinely non-Markovian observed processes. We develop a general theoretical framework based on stochastic thermodynamics for this class of architectures and introduce the entropy production, which can be efficiently estimated from sampled trajectories without exponential sampling cost despite the non-Markovian nature of the observed dynamics. As a proof-of-concept experiment for a large language model (LLM), we evaluate the token-level and sentence-level entropy production for a pre-trained Transformer-based model, GPT-2. We also demonstrate the framework in the linear Gaussian case, where the model reduces to the Kalman innovation representation and the entropy production admits an analytical expression. We further show that the entropy production decomposes exactly into non-negative per-step contributions in terms of retrospective inference, and each of those terms further splits into information-theoretically meaningful terms: a compression loss and a model mismatch. Our results establish a bridge between stochastic thermodynamics and modern generative models, and provide a starting point for quantifying irreversibility in a broad class of highly non-Markovian processes such as LLMs.

Stochastic Thermodynamics for Autoregressive Generative Models: A Non-Markovian Perspective

Abstract

Autoregressive generative models -- including Transformers, recurrent neural networks, classical Kalman filters, state space models, and Mamba -- all generate sequences by sampling each output from a deterministic summary of the past, producing genuinely non-Markovian observed processes. We develop a general theoretical framework based on stochastic thermodynamics for this class of architectures and introduce the entropy production, which can be efficiently estimated from sampled trajectories without exponential sampling cost despite the non-Markovian nature of the observed dynamics. As a proof-of-concept experiment for a large language model (LLM), we evaluate the token-level and sentence-level entropy production for a pre-trained Transformer-based model, GPT-2. We also demonstrate the framework in the linear Gaussian case, where the model reduces to the Kalman innovation representation and the entropy production admits an analytical expression. We further show that the entropy production decomposes exactly into non-negative per-step contributions in terms of retrospective inference, and each of those terms further splits into information-theoretically meaningful terms: a compression loss and a model mismatch. Our results establish a bridge between stochastic thermodynamics and modern generative models, and provide a starting point for quantifying irreversibility in a broad class of highly non-Markovian processes such as LLMs.

Paper Structure

This paper contains 53 sections, 106 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Schematic of the general causal structure of our setup for the general (non-recursive) case. (a) the forward process \ref{['eq:P-forward']}, and (b) the backward process \ref{['eq:P-backward']}. The blue arrows indicate deterministic functions, while the green arrows indicate stochastic influences. This figure illustrates the particular realization $\tilde{y}_s = y_{T-s+1}$; even in this case, $\tilde{h}_s \neq h_{T-s+1}$ in general.
  • Figure 2: Schematic of the causal structure for the recursive case. (a) the forward process \ref{['eq:P-forward']} with \ref{['r_h_f']}, and (b) the backward process \ref{['eq:P-backward']} with \ref{['re_backward']}. The blue arrows indicate deterministic functions, while the green arrows indicate stochastic influences. By following the arrows, one can see that this recursive diagram is a special case of the general diagram (Figure \ref{['fig:causal']}). This figure illustrates the particular realization $\tilde{y}_s = y_{T-s+1}$; even in this case, $\tilde{h}_s \neq h_{T-s+1}$ in general.
  • Figure 3: Distribution of the per-token stochastic entropy production for sequences of $T=120$ tokens sampled from GPT-2 (no top-$k$ or nucleus truncation). The temperature parameter is $\tau = 1$. (a) Token-level reversal $\sigma_{\mathrm{token}}/T$, computed on full generated sequences (filled purple); the dashed orange step histogram shows the reference $\sigma_{\mathrm{token}}(T')/T'$, i.e. token-level reversal applied to the truncated sequences of length $T'$. (b) Block-level reversal $\sigma_{\mathrm{block}}/T'$, where each sequence is post-hoc truncated at the last sentence-final punctuation token (length $T' \le T$). Dashed red lines indicate the sample means; the dotted orange line in (a) indicates the mean of the reference distribution. We collect samples until $N = 500$ of them satisfy the bijection condition for block reversal (b). Samples that fail this condition are excluded from the block-level analysis but retained for the token-level one, so the token-reversal count in (a) is slightly larger (namely 516). Note the different horizontal scales between (a) and (b).
  • Figure 4: Per-token stochastic entropy production evaluated on GPT-2 for 30 causal texts (red) and 30 non-causal texts (blue) generated by a separate language model (Claude Opus 4.6). (a) Token-level reversal $\sigma_{\mathrm{token}}/T$; (b) block (sentence)-level reversal $\sigma_{\mathrm{block}}/T$. Individual data points are shown as a strip plot. In each panel, the box spans the interquartile range (25th to 75th percentiles) of the 30 samples, the horizontal line inside the box marks the median, and the whiskers extend to the most extreme data point within $1.5$ times the interquartile range from the box edges; diamonds indicate the sample mean. The temperature parameter of GPT-2 is $\tau = 1$. Note the different vertical scales between panels (a) and (b).
  • Figure 5: Numerical verification of the analytical entropy production \ref{['eq:new-kl-R']} by Monte Carlo sampling \ref{['eq:sigma-per-traj']}--\ref{['eq:sigma-MC']} with $N = 20{,}000$ trajectories. Solid curves: analytical values; circles with error bars: Monte Carlo estimates. Error bars indicate $\pm 1$ standard error of the mean, $\mathrm{SE} = \mathrm{std}(\sigma) / \sqrt{N}$. (a) Scalar case ($n_x = n_y = 1$) with $A = 0.9$, $C = 1$, $Q = 1$, $R = 1$, and (b) Multivariate case ($n_x = n_y = 2$) with $A = \bigl(0.80.300.5\bigr)$, $C = \bigl(10.501\bigr)$, $Q = I_2$, $R = I_2$.
  • ...and 5 more figures