Table of Contents
Fetching ...

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Wenjin Hou, Shangpin Peng, Weinong Wang, Zheng Ruan, Yue Zhang, Zhenglin Zhou, Mingqi Gao, Yifei Chen, Kaiqi Wang, Hongming Yang, Chengquan Zhang, Zhuotao Tian, Han Hu, Yi Yang, Fei Wu, Hehe Fan

Abstract

On-policy distillation (OPD) has recently emerged as an effective post-training paradigm for consolidating the capabilities of specialized expert models into a single student model. Despite its empirical success, the conditions under which OPD yields reliable improvement remain poorly understood. In this work, we identify two fundamental bottlenecks that limit effective OPD: insufficient exploration of informative states and unreliable teacher supervision for student rollouts. Building on this insight, we propose Uni-OPD, a unified OPD framework that generalizes across Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), centered on a dual-perspective optimization strategy. Specifically, from the student's perspective, we adopt two data balancing strategies to promote exploration of informative student-generated states during training. From the teacher's perspective, we show that reliable supervision hinges on whether aggregated token-level guidance remains order-consistent with the outcome reward. To this end, we develop an outcome-guided margin calibration mechanism to restore order consistency between correct and incorrect trajectories. We conduct extensive experiments on 5 domains and 16 benchmarks covering diverse settings, including single-teacher and multi-teacher distillation across LLMs and MLLMs, strong-to-weak distillation, and cross-modal distillation. Our results verify the effectiveness and versatility of Uni-OPD and provide practical insights into reliable OPD.

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Abstract

On-policy distillation (OPD) has recently emerged as an effective post-training paradigm for consolidating the capabilities of specialized expert models into a single student model. Despite its empirical success, the conditions under which OPD yields reliable improvement remain poorly understood. In this work, we identify two fundamental bottlenecks that limit effective OPD: insufficient exploration of informative states and unreliable teacher supervision for student rollouts. Building on this insight, we propose Uni-OPD, a unified OPD framework that generalizes across Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), centered on a dual-perspective optimization strategy. Specifically, from the student's perspective, we adopt two data balancing strategies to promote exploration of informative student-generated states during training. From the teacher's perspective, we show that reliable supervision hinges on whether aggregated token-level guidance remains order-consistent with the outcome reward. To this end, we develop an outcome-guided margin calibration mechanism to restore order consistency between correct and incorrect trajectories. We conduct extensive experiments on 5 domains and 16 benchmarks covering diverse settings, including single-teacher and multi-teacher distillation across LLMs and MLLMs, strong-to-weak distillation, and cross-modal distillation. Our results verify the effectiveness and versatility of Uni-OPD and provide practical insights into reliable OPD.

Paper Structure

This paper contains 40 sections, 13 equations, 16 figures, 15 tables, 3 algorithms.

Figures (16)

  • Figure 1: Overall performance comparisons and convergence behavior. Results are shown for settings including multi-teacher, strong-to-weak, and cross-modal distillation on math reasoning and code generation tasks. Uni-OPD consistently outperforms OPD and converges faster than RL, demonstrating its effectiveness across diverse settings.
  • Figure 2: Overview of the Uni-OPD framework. (Left) Offline difficulty-aware and online correctness-aware data balancing promote student exploration. (Right) Outcome-guided margin calibration mechanism improves the reliability of teacher supervision. (Middle) The resulting student policy merges complementary capabilities from multiple domain-specific teachers more effectively than standard OPD, leading to stronger overall performance.
  • Figure 3: Data difficulty distribution and its impact on OPD performance. (Left) Training data often exhibits mirrored J-shaped or U-shaped difficulty distributions. (Right) A naive strategy is to filter out overly easy or overly hard samples (i.e., all-correct or all-wrong cases), but this reduces diversity. In contrast, our difficulty-balancing strategy upsamples mid-difficulty samples to preserve a balanced spectrum and empirically outperforms filtering.
  • Figure 4: Impact of online correct and incorrect ratio on student final performance.
  • Figure 5: Demonstration of unreliable teacher supervision and outcome-guided margin calibration mechanism. (Left) Standard teacher supervision in OPD suffers from misalignment between trajectory-level return and outcome rewards, yielding unreliable supervision signals. (Right) Our method uses outcome rewards as a global anchor to calibrate returns through margin-based adjustment, restoring order consistency and improving optimization stability.
  • ...and 11 more figures