Table of Contents
Fetching ...

In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback

Mingye Zhu, Yi Liu, Zheren Fu, Quan Wang, Yongdong Zhang

TL;DR

InTRO introduces token-level exploration with self-generated feedback by aligning an LLM’s forward policy with an answer-conditioned posterior through KL divergence, providing dense, per-token guidance without external supervision. By approximating the intractable objective with an estimated posterior conditioned on the correct answer and token-level importance weights, InTRO achieves accurate and concise chain-of-thought reasoning while maintaining computational efficiency. Empirically, it yields consistent improvements in math reasoning benchmarks and exhibits robust cross-domain generalization to non-mathematical tasks, with shorter rationales that preserve or enhance correctness. The work thus offers a principled, scalable alternative to coarse RL or single-solution fine-tuning for improving reasoning in LLMs.

Abstract

Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single "golden" rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors-token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.

In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback

TL;DR

InTRO introduces token-level exploration with self-generated feedback by aligning an LLM’s forward policy with an answer-conditioned posterior through KL divergence, providing dense, per-token guidance without external supervision. By approximating the intractable objective with an estimated posterior conditioned on the correct answer and token-level importance weights, InTRO achieves accurate and concise chain-of-thought reasoning while maintaining computational efficiency. Empirically, it yields consistent improvements in math reasoning benchmarks and exhibits robust cross-domain generalization to non-mathematical tasks, with shorter rationales that preserve or enhance correctness. The work thus offers a principled, scalable alternative to coarse RL or single-solution fine-tuning for improving reasoning in LLMs.

Abstract

Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single "golden" rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors-token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.

Paper Structure

This paper contains 33 sections, 1 theorem, 19 equations, 15 figures, 4 tables, 1 algorithm.

Key Result

Proposition 2.1

Under the assumption that $y = f(z)$ is a deterministic function of $z$, the gradient of the marginal log-likelihood objective (Eq. (eq:marginal)) is identical to the gradient derived from minimizing the KL-divergence objective (Eq. (eq:kl_objective)):

Figures (15)

  • Figure 1: When solving a reasoning task, the initial model gives a rationale (green) different from its answer-conditioned counterpart (blue), InTRO leverages this information discrepancy to compute the correction factors during InTRO training, yielding an updated model that produces concise and accurate rationales (orange).
  • Figure 2: The illustration of the InTRO framework. Top. The policy $\pi_\theta$ generates reasoning paths for query $x$ and only paths that yield the correct answer are retained. Middle. For each retained prefix $z_{<t}$ we (i) sample $n$ next tokens $z_{t}^i$ from the forward policy (green) and (ii) obtain the corresponding token probabilities from the estimated posterior by conditioning on the concatenated input $x \oplus y$ (blue). The ratio of these probabilities gives the token-level correction factor $w_{t}^i$ (orange) for $z_t^{i}$. Bottom. At every position, gradients are aggregated according to $w_{t}^{i}$, providing token-level feedback that guides training.
  • Figure 3: Avg. response length on test questions. InTRO provides remarkably shorter rationales, especially on more challenging problems and stronger base models (Qwen3).
  • Figure 4: Generated response length during training (on training and test questions, respectively).
  • Figure 5: Test accuracy on "Hard" set during training.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Proposition 2.1