Table of Contents
Fetching ...

Online Knowledge Distillation with Reward Guidance

Chen Jia

TL;DR

PbKD reframes knowledge distillation as reward-guided imitation learning, solving a min-max problem between the student and a reward model constrained by human or AI preferences. It introduces offline and online PbKD with theoretical guarantees: a suboptimality bound in the offline setting of $O(\sqrt{\log(N/\delta)/N})$ and a regret bound in the online setting of $O(\sqrt{T \log T \log(T/\delta)})$, plus a moment-matching extension (MM PbKD) that leverages a $Q$-function formulation for white-box KD. Empirically, PbKD outperforms standard black-box and white-box KD baselines across five black-box and five white-box benchmarks, with iterated online preference updates yielding consistent gains. The approach is practical for both API-limited and fully observable teachers, offering a principled path to more task-aligned distillation with robust performance under reward uncertainty.

Abstract

This work studies knowledge distillation (KD) for large language models (LLMs) through preference optimization. We propose a reward-guided imitation learning framework for sequential KD, formulating a min-max optimization problem between the policy and reward model (RM) to minimize the performance gap between the student and teacher policies. Specifically, the reward optimization is constrained to achieve near-optimality within a confidence set for preference alignment. For preference data construction, we explore both offline and online preference-based KD. Additionally, we reformulate the RM using the $Q$-value function and extend the framework to white-box KD, where the teacher policy's predicted probabilities are accessible. Theoretical analysis and empirical results demonstrate the effectiveness of the proposed framework.

Online Knowledge Distillation with Reward Guidance

TL;DR

PbKD reframes knowledge distillation as reward-guided imitation learning, solving a min-max problem between the student and a reward model constrained by human or AI preferences. It introduces offline and online PbKD with theoretical guarantees: a suboptimality bound in the offline setting of and a regret bound in the online setting of , plus a moment-matching extension (MM PbKD) that leverages a -function formulation for white-box KD. Empirically, PbKD outperforms standard black-box and white-box KD baselines across five black-box and five white-box benchmarks, with iterated online preference updates yielding consistent gains. The approach is practical for both API-limited and fully observable teachers, offering a principled path to more task-aligned distillation with robust performance under reward uncertainty.

Abstract

This work studies knowledge distillation (KD) for large language models (LLMs) through preference optimization. We propose a reward-guided imitation learning framework for sequential KD, formulating a min-max optimization problem between the policy and reward model (RM) to minimize the performance gap between the student and teacher policies. Specifically, the reward optimization is constrained to achieve near-optimality within a confidence set for preference alignment. For preference data construction, we explore both offline and online preference-based KD. Additionally, we reformulate the RM using the -value function and extend the framework to white-box KD, where the teacher policy's predicted probabilities are accessible. Theoretical analysis and empirical results demonstrate the effectiveness of the proposed framework.

Paper Structure

This paper contains 14 sections, 11 theorems, 68 equations, 3 figures, 5 tables, 3 algorithms.

Key Result

Proposition 1

Under Assumption asp:linearbound, for any $\epsilon \leq 1$, the $\epsilon$-bracketing number of the MLE objective function class $\mathcal{L}_r$ with respect to the $L_\infty$ norm satisfies: $\log \mathcal{N}_{[]}(\epsilon, \mathcal{L}_r, L_{\infty}) \leq \mathcal{O} \left( d \log \frac{B}{\epsilo

Figures (3)

  • Figure 1: We formulate reward-guided imitation learning as the optimization of the performance gap between the teacher policy $\pi_E$ and the student policy $\hat{\pi}$, defined as $\hat{\pi} := \mathop{\arg\min}_\pi \max_{r} J(\pi_E, r) - J(\pi, r)$ (Eq. (\ref{['eq:obj']})). In the offline PbKD setting, a pre-collected preference dataset is used to construct a confidence set for constraining the reward model with MLE, i.e., $r \in \mathcal{R}(\mathcal{D}^{\rm pref}_{\rm off})$ (Algorithm \ref{['alg:offlinekd']}). In the online PbKD setting, new preference data are iteratively collected from the current student policy and incorporated into the confidence set, resulting in time-dependent constraints: $r \in \mathcal{R}(\mathcal{D}^{\rm pref}_t)$ at each iteration $t \in \{1, 2, \ldots, T\}$ (Algorithm \ref{['alg:onlinekd']}).
  • Figure 2: The prompt template for GPT-4 feedback.
  • Figure 3: Impact of reward model ($Q$-Function) size on performance over online iterations.

Theorems & Definitions (24)

  • Definition 1: $\epsilon$-Bracketing Number of the MLE Class $\mathcal{L}_r$
  • Proposition 1: Bracketing Number of the MLE Loss Class zhanprovable
  • Definition 2: Concentrability Coefficient for Offline Preference
  • Theorem 1: Suboptimality
  • Remark 1: Convergence Rate and Sample Complexity
  • Definition 3: Concentrability Coefficient for Online Preference
  • Theorem 2: Regret
  • Remark 2: Convergence Rate and Complexity
  • Proposition 2: Performance Difference Lemma swamy2021momentsjiaadversarial
  • Lemma 1: Bounding $L_1$-Norm of Reward Difference by Total Variation Distance
  • ...and 14 more