Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Xinghao Zhao

Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Xinghao Zhao

Abstract

Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive. We study whether the shape of uncertainty dynamics across reasoning steps--captured by sampling a few answer completions per step--predicts correctness. We introduce entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. On GSM8K (n=300) with Qwen2.5-7B-Instruct, monotone chains achieve 68.8% accuracy vs. 46.8% for non-monotone chains (+21.9 pp; Fisher's p=0.0005; OR=2.50). Critically, total entropy reduction is not predictive ($ρ$=-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186->0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question--1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures.

Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Abstract

=-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186->0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question--1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures.

Paper Structure (70 sections, 4 equations, 6 figures, 15 tables)

This paper contains 70 sections, 4 equations, 6 figures, 15 tables.

Introduction
Main finding.
Contributions.
Setup and Methods
Model and dataset.
Chain-of-thought generation.
Step-level entropy measurement.
Scalar coherence.
Binary monotonicity.
Step-level calibration.
Selective prediction.
Step-Level Calibration
ECE increases monotonically with step depth.
Token log-prob proxies share ranking but compress confidence range.
Answer-distribution confidence.
...and 55 more sections

Figures (6)

Figure 1: Two key findings from our diagnostic study. (a) Token log-probability confidence becomes increasingly miscalibrated at later reasoning steps. (b) Chains with monotone entropy trajectories achieve substantially higher accuracy than non-monotone chains.
Figure 2: Example per-step answer-distribution entropy trajectories. Left: A monotone trajectory (each step reduces $H_k$), corresponding to a correct final answer. Right: A non-monotone trajectory with a mid-chain entropy spike, corresponding to an incorrect answer. $H_k$ is defined in \ref{['eq:entropy']}.
Figure 3: Accuracy vs. coverage for five cheap reliability signals. Entropy-trajectory monotonicity (solid blue) dominates all scalar baselines. Scalar coherence (dotted, AURC $=0.408$) is below the random baseline. Dashed line: full-coverage accuracy ($63.0\%$).
Figure 4: Monotone $-$ non-monotone accuracy gap across three additional seeds on GSM8K ($n{=}300$ per seed). The gap remains positive for all seeds.
Figure 5: Coverage-aware answered-set accuracy on GSM8K ($n{=}300$). SC confidence is vote-agreement fraction for SC@3/SC@5; our curve uses monotonicity-first ranking. Dotted line marks 73.7% coverage.
...and 1 more figures

Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Abstract

Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Authors

Abstract

Table of Contents

Figures (6)