Entropy-Aware On-Policy Distillation of Language Models

Woogyeol Jin; Taywon Min; Yongjin Yang; Swanand Ravindra Kadhe; Yi Zhou; Dennis Wei; Nathalie Baracaldo; Kimin Lee

Entropy-Aware On-Policy Distillation of Language Models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee

TL;DR

Entropy-Aware On-Policy Distillation balances mode-seeking precision with mode-covering robustness without sacrificing on-policy training efficiency, and demonstrates that accounting for teacher uncertainty is essential for maintaining diversity and achieving effective knowledge transfer.

Abstract

On-policy distillation is a promising approach for transferring knowledge between language models, where a student learns from dense token-level signals along its own trajectories. This framework typically uses reverse KL divergence, encouraging the student to match the teacher's high-confidence predictions. However, we show that the mode-seeking property of reverse KL reduces generation diversity and yields unstable learning signals when the teacher distribution has high entropy. To address this, we introduce Entropy-Aware On-Policy Distillation. Our key idea is augmenting the standard reverse KL objective with forward KL when teacher entropy is high, capturing the full range of plausible outputs while retaining precise imitation elsewhere. It balances mode-seeking precision with mode-covering robustness without sacrificing on-policy training efficiency. Experiments show that our method maintains generation diversity (sustained token-level entropy) and improves student-teacher alignment (lower forward KL on high-entropy tokens). Across six math reasoning benchmarks, this yields Pass@8 accuracy gains of +1.37 for Qwen3-0.6B-Base, +2.39 for Qwen3-1.7B-Base, and +5.05 for Qwen3-4B-Base compared to baseline on-policy distillation methods. These results demonstrate that accounting for teacher uncertainty is essential for maintaining diversity and achieving effective knowledge transfer.

Entropy-Aware On-Policy Distillation of Language Models

TL;DR

Abstract

Paper Structure (27 sections, 16 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 27 sections, 16 equations, 11 figures, 9 tables, 1 algorithm.

Introduction
Contribution.
Preliminaries
KL-Based Divergences
On-Policy Distillation
OPD with Clipped-Reverse KL
Diversity Degradation and Instability in On-Policy Distillation
Token-Level Entropy Analysis
Instability of Reverse KL-based Reward
Entropy-Aware On-Policy Distillation
Experiments
Experimental Settings
Main Results
Out-of-domain Evaluation
Token-Level Entropy Analysis
...and 12 more sections

Figures (11)

Figure 1: Top-10 change rate for Scenario A (blue), where the teacher distribution has low entropy, and Scenario B (red), where the teacher distribution has high entropy across 3 seeds. For Scenario B, a student optimized with reverse KL fails to capture the teacher’s distribution, as evidenced by persistently high and fluctuating Top-10 change rates.
Figure 2: Pass@$k$ performance comparison between OPD and EOPD on the AIME and AMC benchmarks using the Qwen3-1.7B model. EOPD achieves higher Pass@$k$ compared to OPD, with the performance gap becoming more pronounced as $k$ increases.
Figure 3: Token-level entropy histograms comparing the Qwen3-8B teacher with Qwen3-1.7B-Base trained using OPD and EOPD on the AIME 2024 and 2025 benchmarks. While both methods exhibit similar distributions to the teacher in the mid-entropy range, EOPD preserves more probability mass in the high-entropy region, staying closer to the teacher than OPD.
Figure 4: Average policy entropy during training for the Qwen3-1.7B-Base student trained with OPD + Entropy Bonus, OPD + Advantage Shaping, and EOPD. Advantage Shaping converges to a lower entropy regime, while Entropy Bonus maintains entropy levels comparable to EOPD.
Figure 5: Average forward KL divergence measured during training at token positions where the teacher distribution exhibits high entropy (entropy $\ge 0.8$) for the Qwen3-1.7B-Base student. Compared to OPD + Entropy Bonus and OPD + Advantage Shaping, EOPD maintains lower forward KL values throughout training, indicating closer alignment with the teacher distribution in regions of high uncertainty.
...and 6 more figures

Entropy-Aware On-Policy Distillation of Language Models

TL;DR

Abstract

Entropy-Aware On-Policy Distillation of Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)