Table of Contents
Fetching ...

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

Yuanfu Wang, Zhixuan Liu, Xiangtian Li, Chaochao Lu, Chao Yang

TL;DR

This work introduces Native Reasoning Training (NRT), a verifier-free framework that treats reasoning traces as latent variables and trains them using intrinsic rewards derived from question–answer pairs. By formalizing reasoning as a latent optimization problem and employing diverse aggregation-based reward schemes, including robust strategies like geometric mean and weighted sums, NRT mitigates policy collapse and fosters long, high-quality reasoning traces. Empirical results across multiple model families and nine reasoning benchmarks demonstrate state-of-the-art performance among verifier-free methods, with particularly large gains in complex multi-step tasks and strong robustness to training instabilities. The approach broadens the applicability of powerful reasoning in unverifiable domains and suggests a scalable path toward more capable reasoning systems that do not rely on external verifiers or expert demonstrations.

Abstract

The prevailing paradigm for training large reasoning models--combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)--is fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers. This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a wide range of unverifiable tasks beyond its scope. To overcome these limitations, we introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations. NRT reframes the training problem by treating the reasoning process as a latent variable. It employs a unified training objective that models reasoning as an optimization problem, intrinsically rewarding paths that increase the model's likelihood of producing the ground-truth answer. This unified perspective allows us to analyze intrinsic failure modes of prior methods, such as policy collapse, and systematically design more robust reward aggregation functions, creating a self-reinforcing feedback loop where the model learns to think in ways that resolve its own uncertainty. Empirical evaluation on Llama and Mistral model families demonstrates that NRT achieves state-of-the-art performance among verifier-free methods, significantly outperforming standard SFT baselines and prior verifier-free RL methods. Our approach yields particularly strong performance gains in complex reasoning domains and exhibits high robustness to policy collapse, offering a general, scalable path toward building more powerful and broadly applicable reasoning systems.

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

TL;DR

This work introduces Native Reasoning Training (NRT), a verifier-free framework that treats reasoning traces as latent variables and trains them using intrinsic rewards derived from question–answer pairs. By formalizing reasoning as a latent optimization problem and employing diverse aggregation-based reward schemes, including robust strategies like geometric mean and weighted sums, NRT mitigates policy collapse and fosters long, high-quality reasoning traces. Empirical results across multiple model families and nine reasoning benchmarks demonstrate state-of-the-art performance among verifier-free methods, with particularly large gains in complex multi-step tasks and strong robustness to training instabilities. The approach broadens the applicability of powerful reasoning in unverifiable domains and suggests a scalable path toward more capable reasoning systems that do not rely on external verifiers or expert demonstrations.

Abstract

The prevailing paradigm for training large reasoning models--combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)--is fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers. This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a wide range of unverifiable tasks beyond its scope. To overcome these limitations, we introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations. NRT reframes the training problem by treating the reasoning process as a latent variable. It employs a unified training objective that models reasoning as an optimization problem, intrinsically rewarding paths that increase the model's likelihood of producing the ground-truth answer. This unified perspective allows us to analyze intrinsic failure modes of prior methods, such as policy collapse, and systematically design more robust reward aggregation functions, creating a self-reinforcing feedback loop where the model learns to think in ways that resolve its own uncertainty. Empirical evaluation on Llama and Mistral model families demonstrates that NRT achieves state-of-the-art performance among verifier-free methods, significantly outperforming standard SFT baselines and prior verifier-free RL methods. Our approach yields particularly strong performance gains in complex reasoning domains and exhibits high robustness to policy collapse, offering a general, scalable path toward building more powerful and broadly applicable reasoning systems.
Paper Structure (38 sections, 43 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 38 sections, 43 equations, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of Reinforcement Learning with Verifiable Rewards (RLVR) and our Native Reasoning Training (NRT). (Top) RLVR uses an external verifier to reward reasoning z that yields an answer y matching the ground-truth $y^\star$. This approach is constrained by its need for a verifier. (Bottom) NRT operates on general SFT data, using only a question x and a reference answer $y^\star$. It trains the model to generate a latent reasoning trace z by intrinsically rewarding traces that increase its own predictive confidence in the reference answer. This self-reinforcing process removes the need for external verifiers or expert-written reasoning.
  • Figure 2: Evolution of the reasoning process during RL fine-tuning across three dimensions: diversity (entropy; a, b), length (c, d), and semantic quality (e, f). Quality is measured by an LLM-as-a-judge using a 0-1 score (see Appendix \ref{['app:llm_judge_details']}). While the RLPR baseline suffers a rapid collapse across all metrics, producing short, repetitive, and low-quality reasoning, NRT variants sustain high-entropy, lengthy, and semantic reasoning on both Llama-3.2-3B and Llama-3.1-8B models. This demonstrates our approach prevents mode collapse while maintaining reasoning integrity.
  • Figure 3: Analysis of how NRT improves prediction of ground-truth tokens, particularly for those the SFT baseline finds most uncertain (high entropy, $H_{\text{SFT}}$). (a) The distribution of token entropy, showing that while most tokens are easy to predict, a long tail of high-entropy tokens exists. (b) Change in relative token probability ($P/P_{\text{SFT}}$). Weighted-sum NRT schemes (WS) provide the largest confidence gains for high-entropy tokens, effectively targeting the model's weaknesses.
  • Figure 4: Qualitative analysis of the vocabulary learned by NRT-WS $(-\log p)$. (b) The word cloud of the generated reasoning process is rich with procedural terms like 'let', 'step', and 'solve'. (a) NRT specifically learns to use meta-cognitive words like 'premise' and 'explanation' in its reasoning. (d) The ground-truth word cloud is more focused on the problem's nouns and final answer. (c) Conversely, NRT learns to suppress answer-specific formatting like 'boxed' from its reasoning traces. Together, these show NRT organically develops a distinct "language of reasoning".