Table of Contents
Fetching ...

Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning

Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Bo Xue, Qingfu Zhang, Hau-San Wong, Taiji Suzuki

TL;DR

This work introduces a tractable Tree-structured Multi-task Markov Chain (TMC) framework to model how post-training strategies like RLVR and inference-based reward shaping affect reasoning traces across multiple tasks. The authors show that both RLVR finetuning and population-based inference-scaling exhibit a simplicity bias, favoring easy, common CoTs and risking the forgetting of rare but crucial hard-CoT traces needed for difficult instances. They formalize antidotes such as rejecting easy instances and KL-regularization, and propose DPRM (Doob’s h-transform)–based sampling to preserve breadth of CoTs, including hard ones, while maintaining cross-task capabilities. Empirical simulations corroborate the theory, indicating that exploration within the existing CoT tree can preserve rare yet critical reasoning paths, guiding the design of more robust post-training and inference strategies. The findings imply that carefully designed exploration and reward schemes are essential to maintain and leverage diverse reasoning patterns in multi-task settings, with potential impact on how we fine-tune and decode large language models for hard problems.

Abstract

Foundation models exhibit broad knowledge but limited task-specific reasoning, motivating post-training strategies such as RLVR and inference scaling with outcome or process reward models (ORM/PRM). While recent work highlights the role of exploration and entropy stability in improving pass@K, empirical evidence points to a paradox: RLVR and ORM/PRM typically reinforce existing tree-like reasoning paths rather than expanding the reasoning scope, raising the question of why exploration helps at all if no new patterns emerge. To reconcile this paradox, we adopt the perspective of Kim et al. (2025), viewing easy (e.g., simplifying a fraction) versus hard (e.g., discovering a symmetry) reasoning steps as low- versus high-probability Markov transitions, and formalize post-training dynamics through Multi-task Tree-structured Markov Chains (TMC). In this tractable model, pretraining corresponds to tree expansion, while post-training corresponds to chain-of-thought reweighting. We show that several phenomena recently observed in empirical studies arise naturally in this setting: (1) RLVR induces a squeezing effect, reducing reasoning entropy and forgetting some correct paths; (2) population rewards of ORM/PRM encourage consistency rather than accuracy, thereby favoring common patterns; and (3) certain rare, high-uncertainty reasoning paths by the base model are responsible for solving hard problem instances. Together, these explain why exploration -- even when confined to the base model's reasoning scope -- remains essential: it preserves access to rare but crucial reasoning traces needed for difficult cases, which are squeezed out by RLVR or unfavored by inference scaling. Building on this, we further show that exploration strategies such as rejecting easy instances and KL regularization help preserve rare reasoning traces. Empirical simulations corroborate our theoretical results.

Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning

TL;DR

This work introduces a tractable Tree-structured Multi-task Markov Chain (TMC) framework to model how post-training strategies like RLVR and inference-based reward shaping affect reasoning traces across multiple tasks. The authors show that both RLVR finetuning and population-based inference-scaling exhibit a simplicity bias, favoring easy, common CoTs and risking the forgetting of rare but crucial hard-CoT traces needed for difficult instances. They formalize antidotes such as rejecting easy instances and KL-regularization, and propose DPRM (Doob’s h-transform)–based sampling to preserve breadth of CoTs, including hard ones, while maintaining cross-task capabilities. Empirical simulations corroborate the theory, indicating that exploration within the existing CoT tree can preserve rare yet critical reasoning paths, guiding the design of more robust post-training and inference strategies. The findings imply that carefully designed exploration and reward schemes are essential to maintain and leverage diverse reasoning patterns in multi-task settings, with potential impact on how we fine-tune and decode large language models for hard problems.

Abstract

Foundation models exhibit broad knowledge but limited task-specific reasoning, motivating post-training strategies such as RLVR and inference scaling with outcome or process reward models (ORM/PRM). While recent work highlights the role of exploration and entropy stability in improving pass@K, empirical evidence points to a paradox: RLVR and ORM/PRM typically reinforce existing tree-like reasoning paths rather than expanding the reasoning scope, raising the question of why exploration helps at all if no new patterns emerge. To reconcile this paradox, we adopt the perspective of Kim et al. (2025), viewing easy (e.g., simplifying a fraction) versus hard (e.g., discovering a symmetry) reasoning steps as low- versus high-probability Markov transitions, and formalize post-training dynamics through Multi-task Tree-structured Markov Chains (TMC). In this tractable model, pretraining corresponds to tree expansion, while post-training corresponds to chain-of-thought reweighting. We show that several phenomena recently observed in empirical studies arise naturally in this setting: (1) RLVR induces a squeezing effect, reducing reasoning entropy and forgetting some correct paths; (2) population rewards of ORM/PRM encourage consistency rather than accuracy, thereby favoring common patterns; and (3) certain rare, high-uncertainty reasoning paths by the base model are responsible for solving hard problem instances. Together, these explain why exploration -- even when confined to the base model's reasoning scope -- remains essential: it preserves access to rare but crucial reasoning traces needed for difficult cases, which are squeezed out by RLVR or unfavored by inference scaling. Building on this, we further show that exploration strategies such as rejecting easy instances and KL regularization help preserve rare reasoning traces. Empirical simulations corroborate our theoretical results.

Paper Structure

This paper contains 25 sections, 27 theorems, 193 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Let $X_0\sim\operatorname{Unif}(S\setminus S_{L})$ and $X_1\sim \mathbb{P}(\cdot|X_0)$ be random samples from the TMC $X$ in Def. def:TMC, the softmax predictor trained by cross-entropy $L_{\text{CE}}= \mathbb{E}_{X_0,X_1}[\log\hat{p}_{{\boldsymbol{\theta}}^{(t-1)}}(X_1|X_0)]$ via Alg. alg:pre achie

Figures (5)

  • Figure 1: Left: abstraction of a 3-layer TMC. Nodes are states grouped into layers $S_1$–$S_3$; solid arrows denote high-prob (confident) transitions and dashed arrows denote low-prob (unsure) transitions. A task is a tuple $(\mathbf{q},\mathbf{a},\mathbf{k})$, where $\mathbf{q}\in\{q,q'\}$ is the question state, $\mathbf{a}\in\{a_1,a_2,a_3\}$ is the answer state, and $\mathbf{k}\in[5]$; Right: a concrete illustration of a 5-task, 3-layer Multi-task TMC. $x,y$ in $q$ represent real numbers (with decimals), whereas $A,B$ in $q'$ represent integers. We here use two instances, namely $3.9 > 3.11?$ of $q$ and $3>3?$ of $q'$ to describe the five tasks: (T1) decimal version ordering ($3.9 < 3.11$); (T2) real-number comparison ($3.9 > 3.11$); (T3) integer equality ($3 - 3 = 0$); (T4) integer-part version ordering (e.g., $3.9 = 3.11$); and (T5) integer-part real-number comparison ($3.9 = 3.11$). Tasks 1–3 are common and each admits $\ge 1$ easy-to-reason CoT, while Tasks 4–5 are rare and admit only hard-to-reason CoTs. For Task 2, there are two valid CoTs: $q \rightarrow o_2^2 \rightarrow a_2$ (where $o_2^2$ merely left-to-right compares number) and $q_2 \rightarrow o_2^3 \rightarrow a_2$ (where $o_2^3$ performs the arithmetic calculation). The instance $3.9 > 3.11?$ admits both CoTs correct. However, for hard question instances such as $0.8+3.1 > 2.11+1.0?$, only the hard-to-reason CoT $q_2 \rightarrow o_2^3 \rightarrow a_2$ is correct, since it requires explicit calculation—left-to-right token comparison alone doesn't suffice.
  • Figure 2: Pass@30 Performance for TASK1 across different sampling strategies. The results show relatively consistent performance across most methods, with DPRM achieving the highest rate of 0.73.
  • Figure 3: Pass@30 Performance for TASK2 across different sampling strategies. The results demonstrate significant performance degradation for standard RL methods (REINFORCE, RAFT, PPO) compared to diversity-promoting approaches, confirming the forgetting phenomenon.
  • Figure 4: Valid CoT Coverage for TASK1 (K=30, N=15, 200 trials). The stacked bars show the proportion of invalid (gray), hard valid (red), and easy valid (green) CoTs generated by each strategy. Standard RL methods show strong simplicity bias with predominantly easy valid CoTs.
  • Figure 5: Valid CoT Coverage for TASK2 (K=30, N=15, 200 trials). The stacked bars show the proportion of invalid (gray), hard valid (red), and easy valid (green) CoTs generated by each strategy. Diversity-promoting methods achieve significantly better coverage compared to standard RL approaches.

Theorems & Definitions (38)

  • Definition 1: Tree-structured Markov Chains (TMC).
  • Definition 2: Multi-task Capability in TMC. (Informal)
  • Theorem 1: Informal Version of Thm. \ref{['thm:prefull']}
  • Theorem 2: Squeezing Effect of RL-finetuning
  • Proposition 1: Advantage Gap between Easy and Hard CoT
  • Corollary 1: RL-rej Enables Hard-CoT Learning
  • Lemma 1: Optimal Sampling of GRPO Variants
  • Corollary 2: KL-regularization Enables Hard-CoT learning and Maintain Cross-task Capability
  • Theorem 3: Failure of Inference-Scaling with ORM/PRM
  • Proposition 2: Population Rewards Favor Easy CoTs
  • ...and 28 more