Table of Contents
Fetching ...

Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

Xin Xu, Cliveb AI, Kai Yang, Tianhao Chen, Yang Wang, Saiyong Yang, Can Yang

TL;DR

This work addresses the high computational cost of RLVR caused by long context requirements. It introduces Thinking-Free Policy Initialization (TFPI), a lightweight pre-RLVR stage that uses ThinkingFree inputs to accelerate convergence and improve token efficiency while preserving slow-thinking patterns. TFPI comprises Thinking-Free inference, Thinking-Free training, and a Thinking-Free initialization phase that can be executed with standard RLVR algorithms (e.g., DAPO), achieving strong performance even with short training contexts and without bespoke rewards. Across multiple model sizes and domains, TFPI either standalone or as a foundation for subsequent RLVR yields higher final accuracy with substantially reduced compute and token usage, suggesting an effective and scalable path for training high-performing, token-efficient reasoning models.

Abstract

Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce **T**hinking-**F**ree **P**olicy **I**nitialization (**TFPI**), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple *ThinkFree* operation, explicitly discarding the thinking content via a direct *</think>* append, to reduce token usage during inference. Training with *ThinkFree*-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.

Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

TL;DR

This work addresses the high computational cost of RLVR caused by long context requirements. It introduces Thinking-Free Policy Initialization (TFPI), a lightweight pre-RLVR stage that uses ThinkingFree inputs to accelerate convergence and improve token efficiency while preserving slow-thinking patterns. TFPI comprises Thinking-Free inference, Thinking-Free training, and a Thinking-Free initialization phase that can be executed with standard RLVR algorithms (e.g., DAPO), achieving strong performance even with short training contexts and without bespoke rewards. Across multiple model sizes and domains, TFPI either standalone or as a foundation for subsequent RLVR yields higher final accuracy with substantially reduced compute and token usage, suggesting an effective and scalable path for training high-performing, token-efficient reasoning models.

Abstract

Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce **T**hinking-**F**ree **P**olicy **I**nitialization (**TFPI**), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple *ThinkFree* operation, explicitly discarding the thinking content via a direct *</think>* append, to reduce token usage during inference. Training with *ThinkFree*-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.

Paper Structure

This paper contains 30 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Our proposed TFPI accelerates the convergence of RLVR to a higher performance ceiling (left) and yields more token-efficient reasoning models (right). Left:avg@32 versus training compute, measured in H20 hours. "Direct RL" refers to directly training Qwen3-4B with a 32K context window using DAPO, while "TFPI + RL" denotes running 32K-context DAPO after initialization with our 3-stage TFPI. The x-axis for TFPI uses a linear scale during the TFPI phase, followed by a logarithmic scale, with the transition indicated by a black vertical line. Right: Average accuracy on 4 reasoning datasets (AIME24/25, Beyond AIME, and GPQA) versus average output tokens. Points in the upper-left region indicate better performance. Baseline names and their corresponding numbers are listed in Table \ref{['tab:results_tokens']}. Red dots denote different stages of our TFPI.
  • Figure 2: Results of the meta-experiment on the ThinkingFree operation. Left: Average output tokens in thinking mode and ThinkingFree mode on AIME25. Right: Evolution of avg@32 and average output tokens on AIME24 with thinking-mode evaluation over training steps under 4K training response length.
  • Figure 3: Behaviour-Level Analysis of DS-1.5B over the TFPI Training Course. The ratio of verification steps and the average output tokens over training steps on the training set in thinking-free mode (Left) and on AIME25 in thinking mode (Right) in 3 stages of TFPI.
  • Figure 4: Parameter-Level Analysis. Left: PCA projection of model parameters from DS-1.5B to final checkpoints. TFPI (blue) starts at A, passes through intermediate points (B1, B2, B3), and ultimately converges near the Direct RL final checkpoint (C). Right: Cosine similarity between parameter updates of TFPI-trained checkpoints and (C-A) across layers during training.
  • Figure 5: Left: For TFPI with DS-1.5B on AIME25 in thinking mode, showing the average number of answer tokens (excluding the thinking part) and the ratio of answer length to total response length over training steps. Right: For long CoT RL with DS-1.5B, showing the average number of tokens during rollout on the training set over training steps, with and without the TFPI stage.