On-Policy Self-Distillation for Reasoning Compression

Hejian Sang; Yuanda Xu; Zhengze Zhou; Ran He; Zhipeng Wang; Jiachen Sun

On-Policy Self-Distillation for Reasoning Compression

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, Jiachen Sun

TL;DR

This work introduces OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves.

Abstract

Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a "be concise" instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token.

On-Policy Self-Distillation for Reasoning Compression

TL;DR

Abstract

Paper Structure (69 sections, 7 theorems, 31 equations, 13 figures, 7 tables, 1 algorithm)

This paper contains 69 sections, 7 theorems, 31 equations, 13 figures, 7 tables, 1 algorithm.

Introduction
Summary of results.
Related Work
Reasoning compression via reinforcement learning.
Reasoning compression via supervised fine-tuning.
Training-free compression.
On-policy self-distillation.
Method
Problem Formulation
Training Objective
Why reverse KL?
Teacher Parameterization
Difficulty-adaptive compression.
Training Algorithm
Computational cost and simplicity.
...and 54 more sections

Key Result

Theorem 1

The OPSDC objective (Eq. eq:loss) is equivalent to maximizing the expected implicit reward:

Figures (13)

Figure 1: The paradox of reasoning compression: less thinking, better answers. Results for Qwen3-14B across three benchmarks of increasing difficulty (30K response token budget). OPSDC compresses reasoning traces by 35--57% while largely preserving or improving accuracy, most dramatically on MATH-500, where accuracy jumps from 70.0% to 86.1%.
Figure 2: Prompt example for student and teacher policies. Both policies share the same model parameters but differ in conditioning context. The teacher receives only a conciseness instruction$c$ prepended to the problem; no ground-truth answers or reference solutions are provided. This is the key distinction from prior self-distillation work shenfeld2025sdft, where the teacher receives the ground-truth solution as privileged information. The student prompt is the original prompt from the DAPO-17K dataset.
Figure 3: Self-distillation preserves model entropy throughout training. Average per-token entropy of the student model over training steps for Qwen3-8B (left) and Qwen3-14B (right) using the concise instruction. Unlike RL with length penalties, which drives entropy toward collapse chen2025dlercui2025entropy, OPSDC maintains stable entropy: the model learns to be concise without losing its exploratory capacity.
Figure 4: Student mean accuracy on training data increases during self-distillation. Qwen3-8B improves from ${\sim}$52% to ${\sim}$66% and Qwen3-14B from ${\sim}$46% to ${\sim}$72%, despite no correctness reward. The concise teacher's implicit reward reshapes the student's output distribution, concentrating probability mass on direct, correct reasoning paths.
Figure 5: Problem 1 illustrates how excessive deliberation leads to a genuine reasoning error: the base model talks itself into a wrong interpretation. Problem 2 shows a format failure: correct reasoning is buried in 3,500 tokens of redundant verification and post-</think> repetition, causing answer extraction to fail. In both cases, compression eliminates the noise that caused the error.
...and 8 more figures

Theorems & Definitions (27)

proof : Proof sketch
Theorem 1: Implicit reward
proof : Proof sketch
Remark 1
proof : Proof sketch
proof : Proof sketch
proof : Proof sketch
proof : Proof sketch
Lemma 1: Chain rule of KL for autoregressive models
proof
...and 17 more

On-Policy Self-Distillation for Reasoning Compression

TL;DR

Abstract

On-Policy Self-Distillation for Reasoning Compression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (27)