Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

Yuanfu Wang; Zhixuan Liu; Xiangtian Li; Chaochao Lu; Chao Yang

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

Yuanfu Wang, Zhixuan Liu, Xiangtian Li, Chaochao Lu, Chao Yang

TL;DR

This work introduces Native Reasoning Training (NRT), a verifier-free framework that treats reasoning traces as latent variables and trains them using intrinsic rewards derived from question–answer pairs. By formalizing reasoning as a latent optimization problem and employing diverse aggregation-based reward schemes, including robust strategies like geometric mean and weighted sums, NRT mitigates policy collapse and fosters long, high-quality reasoning traces. Empirical results across multiple model families and nine reasoning benchmarks demonstrate state-of-the-art performance among verifier-free methods, with particularly large gains in complex multi-step tasks and strong robustness to training instabilities. The approach broadens the applicability of powerful reasoning in unverifiable domains and suggests a scalable path toward more capable reasoning systems that do not rely on external verifiers or expert demonstrations.

Abstract

The prevailing paradigm for training large reasoning models--combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)--is fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers. This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a wide range of unverifiable tasks beyond its scope. To overcome these limitations, we introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations. NRT reframes the training problem by treating the reasoning process as a latent variable. It employs a unified training objective that models reasoning as an optimization problem, intrinsically rewarding paths that increase the model's likelihood of producing the ground-truth answer. This unified perspective allows us to analyze intrinsic failure modes of prior methods, such as policy collapse, and systematically design more robust reward aggregation functions, creating a self-reinforcing feedback loop where the model learns to think in ways that resolve its own uncertainty. Empirical evaluation on Llama and Mistral model families demonstrates that NRT achieves state-of-the-art performance among verifier-free methods, significantly outperforming standard SFT baselines and prior verifier-free RL methods. Our approach yields particularly strong performance gains in complex reasoning domains and exhibits high robustness to policy collapse, offering a general, scalable path toward building more powerful and broadly applicable reasoning systems.

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

TL;DR

Abstract

Paper Structure (38 sections, 43 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 38 sections, 43 equations, 4 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Method
Preliminaries: Standard Paradigms for Reasoning Training
Supervised Fine-Tuning (SFT).
Reinforcement Learning with Verifiable Rewards (RLVR).
Native Reasoning Training
Intrinsic Reward Shaping via Aggregation Functions
Reward Stabilization
Structural Format Supervision
Experiments
Experimental Setup
Main Results
Analysis of Training Dynamics of the Reasoning Process
Analysis of Groundtruth Token Probabilities
...and 23 more sections

Figures (4)

Figure 1: Comparison of Reinforcement Learning with Verifiable Rewards (RLVR) and our Native Reasoning Training (NRT). (Top) RLVR uses an external verifier to reward reasoning z that yields an answer y matching the ground-truth $y^\star$. This approach is constrained by its need for a verifier. (Bottom) NRT operates on general SFT data, using only a question x and a reference answer $y^\star$. It trains the model to generate a latent reasoning trace z by intrinsically rewarding traces that increase its own predictive confidence in the reference answer. This self-reinforcing process removes the need for external verifiers or expert-written reasoning.
Figure 2: Evolution of the reasoning process during RL fine-tuning across three dimensions: diversity (entropy; a, b), length (c, d), and semantic quality (e, f). Quality is measured by an LLM-as-a-judge using a 0-1 score (see Appendix \ref{['app:llm_judge_details']}). While the RLPR baseline suffers a rapid collapse across all metrics, producing short, repetitive, and low-quality reasoning, NRT variants sustain high-entropy, lengthy, and semantic reasoning on both Llama-3.2-3B and Llama-3.1-8B models. This demonstrates our approach prevents mode collapse while maintaining reasoning integrity.
Figure 3: Analysis of how NRT improves prediction of ground-truth tokens, particularly for those the SFT baseline finds most uncertain (high entropy, $H_{\text{SFT}}$). (a) The distribution of token entropy, showing that while most tokens are easy to predict, a long tail of high-entropy tokens exists. (b) Change in relative token probability ($P/P_{\text{SFT}}$). Weighted-sum NRT schemes (WS) provide the largest confidence gains for high-entropy tokens, effectively targeting the model's weaknesses.
Figure 4: Qualitative analysis of the vocabulary learned by NRT-WS $(-\log p)$. (b) The word cloud of the generated reasoning process is rich with procedural terms like 'let', 'step', and 'solve'. (a) NRT specifically learns to use meta-cognitive words like 'premise' and 'explanation' in its reasoning. (d) The ground-truth word cloud is more focused on the problem's nouns and final answer. (c) Conversely, NRT learns to suppress answer-specific formatting like 'boxed' from its reasoning traces. Together, these show NRT organically develops a distinct "language of reasoning".

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

TL;DR

Abstract

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

Authors

TL;DR

Abstract

Table of Contents

Figures (4)