Table of Contents
Fetching ...

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou

TL;DR

This work addresses the challenge of integrating off-policy expert data with on-policy exploration for post-tuning instruction-following LLMs. It introduces CHORD, a dynamic weighting framework that uses a global coefficient mu and a token-level weight phi to harmonize SFT and RL within a single objective. Through theoretical framing and extensive experiments on math reasoning and tool-use tasks, CHORD demonstrates improved stability and performance over traditional SFT-then-RL and other baselines. The proposed dual-control design provides practical guidance for robustly absorbing expert knowledge while preserving the model's autonomous reasoning capabilities, with open-source release to foster further research.

Abstract

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from the expert, which promotes on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on mathematical reasoning problems and practical tool-use tasks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

TL;DR

This work addresses the challenge of integrating off-policy expert data with on-policy exploration for post-tuning instruction-following LLMs. It introduces CHORD, a dynamic weighting framework that uses a global coefficient mu and a token-level weight phi to harmonize SFT and RL within a single objective. Through theoretical framing and extensive experiments on math reasoning and tool-use tasks, CHORD demonstrates improved stability and performance over traditional SFT-then-RL and other baselines. The proposed dual-control design provides practical guidance for robustly absorbing expert knowledge while preserving the model's autonomous reasoning capabilities, with open-source release to foster further research.

Abstract

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data's influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from the expert, which promotes on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on mathematical reasoning problems and practical tool-use tasks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.

Paper Structure

This paper contains 34 sections, 6 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: We train Qwen2.5-1.5B-Instruct on the Open-R1 dataset and evaluate the performance on a held-out validation set. These results show that the SFT-then-RL training paradigm can yield suboptimal performance compared to pure RL.
  • Figure 2: Average response length on math problems and tool-use tasks.
  • Figure 3: An overview of the proposed Chord framework that unifies SFT and RL, featuring a global coefficient $\mu$ and a token-wise weighting function $\phi(\cdot)$.
  • Figure 4: Decaying the value of $\mu$ enables a smooth transition from off-policy imitation to on-policy optimization.
  • Figure 5: Comparisons of entropy loss between pure RL and mixed RL that integrates expert data (with or without the IS strategy).
  • ...and 7 more figures