Table of Contents
Fetching ...

Reinforcement Inference: Leveraging Uncertainty for Self-Correcting Language Model Reasoning

Xinhai Sun

TL;DR

Reinforcement Inference introduces an inference-time, uncertainty-driven second-pass mechanism that prompts a more deliberate reasoning attempt only when the model signals hesitation. By using entropy $H(P)$ and maximum softmax probability $MSP(P)$ to trigger a second pass, the approach yields substantial accuracy gains on 12,032 MMLU-Pro questions with DeepSeek-v3.2, increasing from $60.72\%$ to $84.03\%$ while incurring modest compute overhead ($\approx 61\%$ more inferences). Ablations show that gains rely on the uncertainty cue rather than generic prompt enhancements, and transfers across model families with varying efficiency signatures, though thresholds may require calibration per architecture. The results advocate an entropy-aware paradigm for measuring and expanding model capability, enabling self-correcting behavior without retraining and informing future training and decoding objectives toward correctness-aligned confidence.

Abstract

Modern large language models (LLMs) are often evaluated and deployed under a \emph{one-shot, greedy} inference protocol, especially in professional settings that require deterministic behavior. This regime can systematically under-estimate a fixed model's true capability: many errors arise not from missing knowledge, but from premature commitment under internal ambiguity. We introduce \emph{Reinforcement Inference}, an entropy-aware inference-time control strategy that uses the model's own uncertainty to selectively invoke a second, more deliberate reasoning attempt, enabling stronger performance \emph{without any retraining}. On 12,032 MMLU-Pro questions across 14 subjects, using DeepSeek-v3.2 with deterministic decoding in a zero-shot setting, Reinforcement Inference improves accuracy from 60.72\% to 84.03\%, while only incurring 61.06\% additional inference calls. A 100\% re-asking ablation reaches 84.35\%, indicating that uncertainty-aware selection captures most of the attainable improvement with substantially less compute. Moreover, a \emph{prompt-only} ablation underperforms the baseline, suggesting that the gains are not explained by generic `` your output had high entropy, think step-by-step'' prompting alone. Beyond providing a practical inference-time upgrade, our results suggest a broader \emph{entropy-aware} paradigm for measuring and expanding model capability: because modern decoder-based models generate outputs autoregressively, entropy and related confidence measures arise naturally as first-class control signals during generation. The resulting gap between one-pass greedy inference and uncertainty-conditioned deliberation offers a diagnostic lens on an LLM's latent reasoning horizon and motivates future training objectives that explicitly constrain correctness--confidence alignment.

Reinforcement Inference: Leveraging Uncertainty for Self-Correcting Language Model Reasoning

TL;DR

Reinforcement Inference introduces an inference-time, uncertainty-driven second-pass mechanism that prompts a more deliberate reasoning attempt only when the model signals hesitation. By using entropy and maximum softmax probability to trigger a second pass, the approach yields substantial accuracy gains on 12,032 MMLU-Pro questions with DeepSeek-v3.2, increasing from to while incurring modest compute overhead ( more inferences). Ablations show that gains rely on the uncertainty cue rather than generic prompt enhancements, and transfers across model families with varying efficiency signatures, though thresholds may require calibration per architecture. The results advocate an entropy-aware paradigm for measuring and expanding model capability, enabling self-correcting behavior without retraining and informing future training and decoding objectives toward correctness-aligned confidence.

Abstract

Modern large language models (LLMs) are often evaluated and deployed under a \emph{one-shot, greedy} inference protocol, especially in professional settings that require deterministic behavior. This regime can systematically under-estimate a fixed model's true capability: many errors arise not from missing knowledge, but from premature commitment under internal ambiguity. We introduce \emph{Reinforcement Inference}, an entropy-aware inference-time control strategy that uses the model's own uncertainty to selectively invoke a second, more deliberate reasoning attempt, enabling stronger performance \emph{without any retraining}. On 12,032 MMLU-Pro questions across 14 subjects, using DeepSeek-v3.2 with deterministic decoding in a zero-shot setting, Reinforcement Inference improves accuracy from 60.72\% to 84.03\%, while only incurring 61.06\% additional inference calls. A 100\% re-asking ablation reaches 84.35\%, indicating that uncertainty-aware selection captures most of the attainable improvement with substantially less compute. Moreover, a \emph{prompt-only} ablation underperforms the baseline, suggesting that the gains are not explained by generic `` your output had high entropy, think step-by-step'' prompting alone. Beyond providing a practical inference-time upgrade, our results suggest a broader \emph{entropy-aware} paradigm for measuring and expanding model capability: because modern decoder-based models generate outputs autoregressively, entropy and related confidence measures arise naturally as first-class control signals during generation. The resulting gap between one-pass greedy inference and uncertainty-conditioned deliberation offers a diagnostic lens on an LLM's latent reasoning horizon and motivates future training objectives that explicitly constrain correctness--confidence alignment.
Paper Structure (30 sections, 7 equations, 3 figures, 10 tables)

This paper contains 30 sections, 7 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: System structure of Reinforcement Inference.
  • Figure 2: First-round distribution of entropy (nats) and MSP on the 12,032-question MMLU-Pro, illustrating the separation between correct and incorrect answers.
  • Figure 3: Accuracy--compute trade-off from the 16-run sweep. Each point is one threshold pair $(\tau_H,\tau_{\mathrm{MSP}})$, plotted by its re-ask rate (x-axis) and final overall accuracy (y-axis).