Table of Contents
Fetching ...

A Dual-Space Framework for General Knowledge Distillation of Large Language Models

Xue Zhang, Songming Zhang, Yunlong Liang, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou

TL;DR

Knowledge distillation for LLMs faces two core issues in white-box KD: misaligned output spaces across teacher and student heads and vocabulary incompatibility. The paper proposes Dual-Space Knowledge Distillation (DSKD), which unifies output spaces by learning ideal-initialized projection heads $W^{t ightarrow s}$ and $W^{s ightarrow t}$ to map hidden states between teacher and student representations, enabling shared prediction heads and consistent divergences. An Exact Token Alignment (ETA) algorithm is added to enable cross-tokenizer KD, aligning tokens across differently tokenized sequences for cross-vocabulary distillation. The approach supports off-policy and on-policy KD and demonstrates superior performance across instruction following, mathematical reasoning, and code generation benchmarks, including cross-vocabulary scenarios evaluated with GPT-4, establishing the method’s generality and practical impact for efficient KD of diverse LLMs.

Abstract

Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model; b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies.

A Dual-Space Framework for General Knowledge Distillation of Large Language Models

TL;DR

Knowledge distillation for LLMs faces two core issues in white-box KD: misaligned output spaces across teacher and student heads and vocabulary incompatibility. The paper proposes Dual-Space Knowledge Distillation (DSKD), which unifies output spaces by learning ideal-initialized projection heads and to map hidden states between teacher and student representations, enabling shared prediction heads and consistent divergences. An Exact Token Alignment (ETA) algorithm is added to enable cross-tokenizer KD, aligning tokens across differently tokenized sequences for cross-vocabulary distillation. The approach supports off-policy and on-policy KD and demonstrates superior performance across instruction following, mathematical reasoning, and code generation benchmarks, including cross-vocabulary scenarios evaluated with GPT-4, establishing the method’s generality and practical impact for efficient KD of diverse LLMs.

Abstract

Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model; b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies.

Paper Structure

This paper contains 39 sections, 27 equations, 9 figures, 11 tables, 2 algorithms.

Figures (9)

  • Figure 1: Simulation results with KL and RKL divergence as the divergence $\mathcal{D}(\cdot||\cdot)$. (a), (b), (c), (e), (f), and (g) plot the student's hidden states and the teacher's hidden states before and after the two KD processes. "Different heads" means using the teacher and student head respectively during KD, while "Shared head" means only using the student head as the shared head to obtain the distributions during KD. (d) and (h) show the convergence curves of $\mathcal{L}_{kd}$ in the two KD processes.
  • Figure 2: The difference between KD for LLMs with the same vocabulary and different vocabularies.
  • Figure 3: The framework of our DSKD. Our DSKD includes the KD in student space and teacher space. In student space, we use the projector $\bm{W}^{t \rightarrow s}$ to project the teacher hidden states $\bm{H}^{t}_{1:n}$ to the student space as $\bm{H}^{t \rightarrow s}_{1:n}$. Then, we feed $\bm{H}^{t \rightarrow s}_{1:n}$ and $\bm{H}^{s}_{1:n}$ to the student prediction head to obtain the distributions in the same space, which are used to calculate $\mathcal{L}^{stu}_{kd}$ and $\mathcal{L}^{t \rightarrow s}_{ce}$. In teacher space, we use the projector $\bm{W}^{s \rightarrow t}$ to project the student hidden states $\bm{H}^{s}_{1:n}$ to the teacher space as $\bm{H}^{s \rightarrow t}_{1:n}$. After the teacher prediction head, we calculate $\mathcal{L}^{tea}_{kd}$ with the two distributions in the teacher space. The overall loss of DSKD sums the three losses: $\mathcal{L}_{dskd}=\mathcal{L}^{stu}_{kd} + \mathcal{L}^{t \rightarrow s}_{ce} + \mathcal{L}^{tea}_{kd}$.
  • Figure 4: The framework of our DSKD for LLMs with different vocabularies in student space.
  • Figure 5: Prompt for GPT-4 Evaluation.
  • ...and 4 more figures