Table of Contents
Fetching ...

Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

Hoang-Chau Luong, Dat Ba Tran, Lingwei Chen

Abstract

Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL), which removes this gradient effect and strengthens non-target supervision while preserving the optimization benefits of RKL. Extensive experiments across datasets and model families demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives, achieving better performance and a superior fidelity-diversity trade-off.

Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

Abstract

Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL), which removes this gradient effect and strengthens non-target supervision while preserving the optimization benefits of RKL. Extensive experiments across datasets and model families demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives, achieving better performance and a superior fidelity-diversity trade-off.

Paper Structure

This paper contains 20 sections, 2 theorems, 21 equations, 7 figures, 4 tables.

Key Result

Proposition 4.1

Let $p, q \in \mathbb{R}^{V}$ denote the teacher and student output probabilities among V classes, respectively, and let $m$ denote the index of the target class. Define $\tilde{p}_m = (p_m, 1 - p_m) \in \mathbb{R}^{2}$ and $\tilde{q}_m = (q_m, 1 - q_m) \in \mathbb{R}^{2}$ as the binary probabilitie $\blacktriangleleft$$\blacktriangleleft$

Figures (7)

  • Figure 1: FKL and RKL when fitting a teacher distribution under different output sizes. The x-axis shows the class index, and the y-axis shows the corresponding output probability. As the number of classes increases, the optimization difficulty grows substantially.
  • Figure 2: (a, b) Fidelity vs. diversity: RKL reduces diversity (Negative Self-BLEU and Distinct-2), while DRKL achieves a better balance across methods. (c) ROUGE-L vs. prediction confidence: RKL produces overconfident predictions without improving ROUGE-L, while DRKL is better calibrated.
  • Figure 3: Losses comparison.
  • Figure 4: Performance of DRKL when combined with SRKL.
  • Figure 5: (a) Validation performance of different distillation losses. (b) DRKL performance across different values of $\gamma$. (c) Efficiency analysis.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Proposition 4.1: Target and non-target decomposition of RKL
  • Proposition 4.2: Target gradient under RKL
  • proof
  • proof