Sinkhorn Distance Minimization for Knowledge Distillation

Xiao Cui; Yulei Qin; Yuting Gao; Enwei Zhang; Zihan Xu; Tong Wu; Ke Li; Xing Sun; Wengang Zhou; Houqiang Li

Sinkhorn Distance Minimization for Knowledge Distillation

Xiao Cui, Yulei Qin, Yuting Gao, Enwei Zhang, Zihan Xu, Tong Wu, Ke Li, Xing Sun, Wengang Zhou, Houqiang Li

TL;DR

SinKD reframes knowledge distillation as a batch-wise optimal transport problem using the Sinkhorn distance, addressing mode-imbalance issues inherent in KL, RKL, and JS divergences. By modeling distributions across batches, it captures geometric structure in high-dimensional logit space and avoids per-sample superficial matching. Empirical results on GLUE and SuperGLUE show SinKD outperforms state-of-the-art KD methods across encoder-only, encoder-decoder, and decoder-only LLMs, with robust ablations and hyper-parameter analyses supporting its effectiveness. The approach offers practical benefits for distilling large language models while remaining computationally tractable due to entropy-regularized OT, and it generalizes to generative LLMs with careful task/formulation considerations.

Abstract

Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few distribution overlap exists between the teacher and the student. In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse NLP tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions. Besides, profit by properties of the Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception of divergence in each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture geometric intricacies of distributions across samples in the high-dimensional space. Comprehensive evaluation on GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures.

Sinkhorn Distance Minimization for Knowledge Distillation

TL;DR

Abstract

Paper Structure (38 sections, 20 equations, 5 figures, 9 tables)

This paper contains 38 sections, 20 equations, 5 figures, 9 tables.

Introduction
Related Work
Knowledge Distillation
Sinkhorn Distance
Methodology
Problem Statement
Classic Divergence Measures
KL Divergence
RKL Divergence
JS Divergence
Sinkhorn Distance
Batch-wise Reformulation
Cost Matrix Computation
Sinkhorn Normalization
Sinkhorn Loss
...and 23 more sections

Figures (5)

Figure 1: Limitations of existing divergence measures for the student to match the teacher in logits-based distillation. (a) Mode-averaging by Kullback-Leibler divergence. (b) Mode-collapsing by reverse Kullback-Leibler divergence. (c) Mode-underestimation by Jensen-Shannon divergence.
Figure 2: Illustration of our SinKD pipeline.
Figure 3: Performance at different student scales on (a) MRPC & (b) QQP. Best viewed magnified.
Figure 4: Effect of (a) $\lambda$ on MRPC & QQP and (b) $\tau_{\text{SD}}$ on MRPC & RTE. Best viewed magnified.
Figure 5: Effect of (a) $\alpha$ on MRPC & SST-2 and (b) $\beta$ on MRPC & RTE. Best viewed magnified.

Sinkhorn Distance Minimization for Knowledge Distillation

TL;DR

Abstract

Sinkhorn Distance Minimization for Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)