Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning
Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Bo Xue, Qingfu Zhang, Hau-San Wong, Taiji Suzuki
TL;DR
This work introduces a tractable Tree-structured Multi-task Markov Chain (TMC) framework to model how post-training strategies like RLVR and inference-based reward shaping affect reasoning traces across multiple tasks. The authors show that both RLVR finetuning and population-based inference-scaling exhibit a simplicity bias, favoring easy, common CoTs and risking the forgetting of rare but crucial hard-CoT traces needed for difficult instances. They formalize antidotes such as rejecting easy instances and KL-regularization, and propose DPRM (Doob’s h-transform)–based sampling to preserve breadth of CoTs, including hard ones, while maintaining cross-task capabilities. Empirical simulations corroborate the theory, indicating that exploration within the existing CoT tree can preserve rare yet critical reasoning paths, guiding the design of more robust post-training and inference strategies. The findings imply that carefully designed exploration and reward schemes are essential to maintain and leverage diverse reasoning patterns in multi-task settings, with potential impact on how we fine-tune and decode large language models for hard problems.
Abstract
Foundation models exhibit broad knowledge but limited task-specific reasoning, motivating post-training strategies such as RLVR and inference scaling with outcome or process reward models (ORM/PRM). While recent work highlights the role of exploration and entropy stability in improving pass@K, empirical evidence points to a paradox: RLVR and ORM/PRM typically reinforce existing tree-like reasoning paths rather than expanding the reasoning scope, raising the question of why exploration helps at all if no new patterns emerge. To reconcile this paradox, we adopt the perspective of Kim et al. (2025), viewing easy (e.g., simplifying a fraction) versus hard (e.g., discovering a symmetry) reasoning steps as low- versus high-probability Markov transitions, and formalize post-training dynamics through Multi-task Tree-structured Markov Chains (TMC). In this tractable model, pretraining corresponds to tree expansion, while post-training corresponds to chain-of-thought reweighting. We show that several phenomena recently observed in empirical studies arise naturally in this setting: (1) RLVR induces a squeezing effect, reducing reasoning entropy and forgetting some correct paths; (2) population rewards of ORM/PRM encourage consistency rather than accuracy, thereby favoring common patterns; and (3) certain rare, high-uncertainty reasoning paths by the base model are responsible for solving hard problem instances. Together, these explain why exploration -- even when confined to the base model's reasoning scope -- remains essential: it preserves access to rare but crucial reasoning traces needed for difficult cases, which are squeezed out by RLVR or unfavored by inference scaling. Building on this, we further show that exploration strategies such as rejecting easy instances and KL regularization help preserve rare reasoning traces. Empirical simulations corroborate our theoretical results.
