Table of Contents
Fetching ...

Direct Preference Knowledge Distillation for Large Language Models

Yixing Li, Yuxian Gu, Li Dong, Dequan Wang, Yu Cheng, Furu Wei

TL;DR

DPKD tackles the inefficacy of standard KL-based KD for large language models by introducing an implicit reward function $r_p(y|x)$ and a direct preference mechanism, forming a two-stage optimization: first maximize $r_p(y|x) - \beta \mathrm{KLD}( q_\theta(y|x) \| p(y|x) )$, then maximize the likelihood that teacher outputs are preferred over student outputs. The method derives a tractable loss with a Bradley-Terry-based preference and a Plackett-Luce-style probability $p^*(y_t \succ y_s|x)$, linking the gradient to a Q-function perspective in RL. The authors provide theoretical derivations and extensive experiments across GPT-2 and OPT family models (120M–13B) on DollyEval, Self-Instruct, and S-NI datasets, showing improved Rouge-L and exact-match scores over strong baselines and analyzing the impact of reward design and generation length. This work offers a principled, scalable approach to instruction tuning for LLMs, with practical implications for efficient distillation and improved alignment of smaller models with teacher capabilities.

Abstract

In the field of large language models (LLMs), Knowledge Distillation (KD) is a critical technique for transferring capabilities from teacher models to student models. However, existing KD methods face limitations and challenges in distillation of LLMs, including efficiency and insufficient measurement capabilities of traditional KL divergence. It is shown that LLMs can serve as an implicit reward function, which we define as a supplement to KL divergence. In this work, we propose Direct Preference Knowledge Distillation (DPKD) for LLMs. DPKD utilizes distribution divergence to represent the preference loss and implicit reward function. We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence and then improving the preference probability of teacher outputs over student outputs. We conducted experiments and analysis on various datasets with LLM parameters ranging from 120M to 13B and demonstrate the broad applicability and effectiveness of our DPKD approach. Meanwhile, we prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis. The DPKD method outperforms the baseline method in both output response precision and exact match percentage. Code and data are available at https://aka.ms/dpkd.

Direct Preference Knowledge Distillation for Large Language Models

TL;DR

DPKD tackles the inefficacy of standard KL-based KD for large language models by introducing an implicit reward function and a direct preference mechanism, forming a two-stage optimization: first maximize , then maximize the likelihood that teacher outputs are preferred over student outputs. The method derives a tractable loss with a Bradley-Terry-based preference and a Plackett-Luce-style probability , linking the gradient to a Q-function perspective in RL. The authors provide theoretical derivations and extensive experiments across GPT-2 and OPT family models (120M–13B) on DollyEval, Self-Instruct, and S-NI datasets, showing improved Rouge-L and exact-match scores over strong baselines and analyzing the impact of reward design and generation length. This work offers a principled, scalable approach to instruction tuning for LLMs, with practical implications for efficient distillation and improved alignment of smaller models with teacher capabilities.

Abstract

In the field of large language models (LLMs), Knowledge Distillation (KD) is a critical technique for transferring capabilities from teacher models to student models. However, existing KD methods face limitations and challenges in distillation of LLMs, including efficiency and insufficient measurement capabilities of traditional KL divergence. It is shown that LLMs can serve as an implicit reward function, which we define as a supplement to KL divergence. In this work, we propose Direct Preference Knowledge Distillation (DPKD) for LLMs. DPKD utilizes distribution divergence to represent the preference loss and implicit reward function. We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence and then improving the preference probability of teacher outputs over student outputs. We conducted experiments and analysis on various datasets with LLM parameters ranging from 120M to 13B and demonstrate the broad applicability and effectiveness of our DPKD approach. Meanwhile, we prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis. The DPKD method outperforms the baseline method in both output response precision and exact match percentage. Code and data are available at https://aka.ms/dpkd.
Paper Structure (29 sections, 30 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 30 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: GPT-4 evaluation of different methods. Our DPKD method outperforms other baselines and is closest to the reference responses.
  • Figure 2: Illustration of the relation of rKLD, implicit reward and Rouge-L. We construct differentiated models by adding random noise to the base model. Lighter colors indicate higher Rouge-L scores.
  • Figure 3: We report the reverse KL divergence, and estimated implicit reward during training. The color of points represents the training epochs. The end of the training falls in the upper left corner, where the KL divergence is low and the reward is high.
  • Figure 4: Reward, KLD, and reverse KLD curves during the distillation process of GPT-2 Base. KLD and reverse KLD show similar trends.
  • Figure 5: RougeL score with different ranges of generation lengths. In the case of different ranges of reference label lengths, DPKD scores higher than the baseline. In particular, DPKD stands out when the golden response length is in the middle range. The raw RougeL scores of each method are provided in the Appendix \ref{['sec:appendix lenth res']}.