Table of Contents
Fetching ...

The α-Law of Observable Belief Revision in Large Language Model Inference

Mike Farmer, Abhinav Kochar, Yugyung Lee

Abstract

Large language models (LLMs) that iteratively revise their outputs through mechanisms such as chain-of-thought reasoning, self-reflection, or multi-agent debate lack principled guarantees regarding the stability of their probability updates. We identify a consistent multiplicative scaling law that governs how instruction-tuned LLMs revise probability assignments over candidate answers, expressed as a belief revision exponent that controls how prior beliefs and verification evidence are combined during updates. We show theoretically that values of the exponent below one are necessary and sufficient for asymptotic stability under repeated revision. Empirical evaluation across 4,975 problems spanning graduate-level benchmarks (GPQA Diamond, TheoremQA, MMLU-Pro, and ARC-Challenge) and multiple model families (GPT-5.2 and Claude Sonnet 4) reveals near-Bayesian update behavior, with models operating slightly above the stability boundary in single-step revisions. However, multi-step experiments demonstrate that the exponent decreases over successive revisions, producing contractive long-run dynamics consistent with theoretical stability predictions. Token-level validation using Llama-3.3-70B further confirms similar behavior across both log-probability measurements and self-reported confidence elicitation. Analysis of update components exposes architecture-specific trust-ratio patterns, with GPT-5.2 showing balanced weighting between prior and evidence, while Claude modestly favors new evidence. This work characterizes observable inference-time update behavior rather than internal Bayesian reasoning, and introduces the α-law as a principled diagnostic for monitoring update stability and reasoning quality in LLM inference systems.

The α-Law of Observable Belief Revision in Large Language Model Inference

Abstract

Large language models (LLMs) that iteratively revise their outputs through mechanisms such as chain-of-thought reasoning, self-reflection, or multi-agent debate lack principled guarantees regarding the stability of their probability updates. We identify a consistent multiplicative scaling law that governs how instruction-tuned LLMs revise probability assignments over candidate answers, expressed as a belief revision exponent that controls how prior beliefs and verification evidence are combined during updates. We show theoretically that values of the exponent below one are necessary and sufficient for asymptotic stability under repeated revision. Empirical evaluation across 4,975 problems spanning graduate-level benchmarks (GPQA Diamond, TheoremQA, MMLU-Pro, and ARC-Challenge) and multiple model families (GPT-5.2 and Claude Sonnet 4) reveals near-Bayesian update behavior, with models operating slightly above the stability boundary in single-step revisions. However, multi-step experiments demonstrate that the exponent decreases over successive revisions, producing contractive long-run dynamics consistent with theoretical stability predictions. Token-level validation using Llama-3.3-70B further confirms similar behavior across both log-probability measurements and self-reported confidence elicitation. Analysis of update components exposes architecture-specific trust-ratio patterns, with GPT-5.2 showing balanced weighting between prior and evidence, while Claude modestly favors new evidence. This work characterizes observable inference-time update behavior rather than internal Bayesian reasoning, and introduces the α-law as a principled diagnostic for monitoring update stability and reasoning quality in LLM inference systems.
Paper Structure (73 sections, 1 theorem, 11 equations, 13 figures, 10 tables)

This paper contains 73 sections, 1 theorem, 11 equations, 13 figures, 10 tables.

Key Result

Theorem 1

Let $q^*$ be a fixed point of update eq:alpha_law for fixed evidence $b$. If $\alpha < 1$, then $q^*$ is asymptotically stable: If $\alpha \geq 1$, the fixed point is unstable under perturbation.

Figures (13)

  • Figure 1: Empirical validation of the $\alpha$-law. Each subplot shows $\log q_1$ vs. $\log q_0 + \log b$ for one model$\times$dataset combination. The fitted slope $\alpha \approx 1.16$ is consistent with near-Bayesian updating. The dashed line shows ideal Bayesian updating ($\alpha = 1$).
  • Figure 2: Cross-model and cross-dataset consistency. (a) Both GPT-5.2 and Claude Sonnet 4 exhibit consistent $\alpha \approx 1.1$--$1.2$, supporting architecture independence of the $\alpha$-law. (b) The same scaling behavior persists across diverse datasets and reasoning domains, indicating cross-task generality.
  • Figure 3: Cross-vendor comparison. GPT-5.2 and Claude Sonnet 4 exhibit similar $\alpha$ distributions, suggesting architecture-independent behavior under the $\alpha$-law. Both models cluster near the Bayesian optimum ($\alpha = 1$).
  • Figure 4: Fallback contamination rates by model$\times$dataset. Green bars ($<$10%) indicate clean data suitable for $\alpha$-law analysis. Red bars show Gemini 2.5's critical contamination, justifying its exclusion from primary analysis.
  • Figure 5: Multi-step $\alpha$ trajectory.$\alpha$ decreases from 0.84 to 0.54 over 7 revision steps, entering the contractive regime and ensuring convergence. Linear decay: slope $= -0.040$, $R^2 = 0.735$, $p = 0.014$.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Theorem 1: $\alpha$-stability
  • proof