Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

Taiqiang Wu; Chaofan Tao; Jiahao Wang; Runming Yang; Zhe Zhao; Ngai Wong

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, Ngai Wong

TL;DR

This work challenges the conventional view that forward KL is mean-seeking and reverse KL is mode-seeking in knowledge distillation for large language models, showing that both losses converge to the teacher distribution given sufficient training. Recognizing practical constraints that limit epochs, the authors reveal that FKL emphasizes head regions of the distribution while RKL attends to tail regions early in training. To exploit these complementary tendencies, they introduce Adaptive Kullback-Leiber (AKL) divergence, which adaptively weights FKL and RKL based on head/tail gaps using a head-tail mask and gap metrics. Empirical results on multiple distillation setups and GPT-4-based evaluations demonstrate that AKL consistently outperforms baselines with modest compute overhead, improving both diversity and quality of generated responses.

Abstract

Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress Large Language Models (LLMs). Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses. Codes are available at \href{https://github.com/wutaiqiang/LLM_KD_AKL}{github}.

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

TL;DR

Abstract

Paper Structure (37 sections, 26 equations, 5 figures, 12 tables)

This paper contains 37 sections, 26 equations, 5 figures, 12 tables.

Introduction
Related Work
KD for LLM Compression
Forward KL and Reverse KL
Preliminary and Rethinking
Background
Rethinking FKL and RKL
Empirical Analysis.
Theoretical Analysis.
Deeper Insights.
Difference between FKL and RKL.
Summary
Methodology
Motivation
Adaptive Kullback-Leiber Divergence
...and 22 more sections

Figures (5)

Figure 1: The toy example in DBLP:journals/corr/abs-2306-08543, where they fit a Gaussian mixture (distribution of teacher) with a Gaussian distribution (distribution of student) using FKL and RKL.
Figure 2: The convergence of FKL and RKL on toy data under epoch 1 and epoch 200. The initial distribution $q$ is the same for FKL and RKL. After 200 epochs, both FKL and RKL can converge to the target distribution well regardless of the shape of $p$.
Figure 3: The distributions at various epochs for FKL and RKL on toy data (long-tail), where the teacher distribution and initial student distribution are the same. We can find that FKL focuses on the head part and RKL on the tail part at the beginning epochs, and both converge finally.
Figure 4: The results of FKL+RKL, proposed AKL, and AKL-r on GPT 2 120M. After flipping the loss weight, AKL-r performs worse than AKL on all three datasets.
Figure 5: The scores of diversity and quality from GPT-4 (total score: 10) on the responses from TinyLLaMA. AKL can improve the diversity and quality of generated responses compared to the baselines.

Theorems & Definitions (2)

proof
proof

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

TL;DR

Abstract

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (2)