On-Policy Supervised Fine-Tuning for Efficient Reasoning

Anhao Zhao; Ziyang Chen; Junlong Tong; Yingqi Fan; Fanghua Ye; Shuhao Li; Yunpu Ma; Wenjie Li; Xiaoyu Shen

On-Policy Supervised Fine-Tuning for Efficient Reasoning

Anhao Zhao, Ziyang Chen, Junlong Tong, Yingqi Fan, Fanghua Ye, Shuhao Li, Yunpu Ma, Wenjie Li, Xiaoyu Shen

TL;DR

It is shown that the optimization problem reduces to supervised fine-tuning on self-generated data filtered for both correctness and conciseness, and despite its simplicity, on-policy SFT consistently defines the accuracy-efficiency Pareto frontier.

Abstract

Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning, achieving strong performance at high computational cost. Recent methods add multi-reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade-offs. We revisit this objective and challenge the necessity of such complexity. Through principled analysis, we identify fundamental misalignments in this paradigm: KL regularization loses its intended role when correctness and length are directly verifiable, and group-wise normalization becomes ambiguous under multiple reward signals. By removing these two items and simplifying the reward to a truncation-based length penalty, we show that the optimization problem reduces to supervised fine-tuning on self-generated data filtered for both correctness and conciseness. We term this simplified training strategy on-policy SFT. Despite its simplicity, on-policy SFT consistently defines the accuracy-efficiency Pareto frontier. It reduces CoT length by up to 80 while maintaining original accuracy, surpassing more complex RL-based methods across five benchmarks. Furthermore, it significantly enhances training efficiency, reducing GPU memory usage by 50% and accelerating convergence by 70%. Our code is available at https://github.com/EIT-NLP/On-Policy-SFT.

On-Policy Supervised Fine-Tuning for Efficient Reasoning

TL;DR

Abstract

Paper Structure (62 sections, 36 equations, 18 figures, 1 table, 1 algorithm)

This paper contains 62 sections, 36 equations, 18 figures, 1 table, 1 algorithm.

Introduction
Preliminary
GRPO for Efficient Reasoning
Reward Shaping for Efficient Reasoning
From GRPO to On-Policy SFT
Revisiting the GRPO Objective
KL Divergence
Group-wise Reward Normalization
The Simplest Length Penalty: Truncation
On-Policy SFT
Experimental Setup
Models and Training Dataset
Implementation Details
Evaluation
Metrics
...and 47 more sections

Figures (18)

Figure 1: On-Policy SFT achieves a state-of-the-art accuracy–length trade-off on DeepSeek-R1-1.5B, reducing CoT length by approximately 80% while slightly improving accuracy.
Figure 2: Performance--efficiency trade-offs of on-policy SFT and baseline methods under varying generation token budgets.
Figure 3: Average GPU memory consumption and wall-clock time per training step of on-policy SFT and the RL-based baseline ThinkPrune for the 1.5B model under varying rollout numbers.
Figure 4: Convergence speed of on-policy SFT and RL-based baseline ThinkPrune on MATH-500 (1.5B; rollout = 8, batch size = 64).
Figure 5: Length control comparison between on-policy SFT and RL-based baseline ThinkPrune measured by the coefficient of variation under varying token budgets on AIME24.
...and 13 more figures

On-Policy Supervised Fine-Tuning for Efficient Reasoning

TL;DR

Abstract

On-Policy Supervised Fine-Tuning for Efficient Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (18)