Table of Contents
Fetching ...

Generalized Kullback-Leibler Divergence Loss

Jiequan Cui, Beier Zhu, Qingshan Xu, Zhuotao Tian, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong

TL;DR

This work reinterprets KL Divergence loss as a Decoupled KL (DKL) loss that splits into a weighted MSE term and a cross-entropy with soft labels, enabling clearer gradient analysis. It identifies asymmetry and sample-bias issues in KL/DKL during knowledge distillation and proposes Generalized KL (GKL) loss, which combines a smoother, class-aware weight function with a decoupled objective to improve convergence and robustness. The approach achieves state-of-the-art adversarial robustness on RobustBench for CIFAR-10/100 and competitive performance on knowledge distillation across CIFAR-10/100, ImageNet, and CLIP-based models, while providing extensive ablations and practical guidelines. The results demonstrate that incorporating global class information and breaking optimization asymmetry substantially enhances both robustness and transfer learning in vision-language and multi-task settings, with broad potential applications and reasonable computational overhead.

Abstract

In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of (1) a weighted Mean Square Error (wMSE) loss and (2) a Cross-Entropy loss incorporating soft labels. Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL loss in scenarios like knowledge distillation by breaking its asymmetric optimization property along with a smoother weight function. This modification effectively alleviates convergence challenges in optimization, particularly for classes with high predicted scores in soft labels. Secondly, we introduce class-wise global information into KL/DKL to reduce bias arising from individual samples. With these two enhancements, we derive the Generalized Kullback-Leibler (GKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100, ImageNet, and vision-language datasets, focusing on adversarial training, and knowledge distillation tasks. Specifically, we achieve new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive knowledge distillation performance across CIFAR/ImageNet models and CLIP models, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.

Generalized Kullback-Leibler Divergence Loss

TL;DR

This work reinterprets KL Divergence loss as a Decoupled KL (DKL) loss that splits into a weighted MSE term and a cross-entropy with soft labels, enabling clearer gradient analysis. It identifies asymmetry and sample-bias issues in KL/DKL during knowledge distillation and proposes Generalized KL (GKL) loss, which combines a smoother, class-aware weight function with a decoupled objective to improve convergence and robustness. The approach achieves state-of-the-art adversarial robustness on RobustBench for CIFAR-10/100 and competitive performance on knowledge distillation across CIFAR-10/100, ImageNet, and CLIP-based models, while providing extensive ablations and practical guidelines. The results demonstrate that incorporating global class information and breaking optimization asymmetry substantially enhances both robustness and transfer learning in vision-language and multi-task settings, with broad potential applications and reasonable computational overhead.

Abstract

In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of (1) a weighted Mean Square Error (wMSE) loss and (2) a Cross-Entropy loss incorporating soft labels. Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL loss in scenarios like knowledge distillation by breaking its asymmetric optimization property along with a smoother weight function. This modification effectively alleviates convergence challenges in optimization, particularly for classes with high predicted scores in soft labels. Secondly, we introduce class-wise global information into KL/DKL to reduce bias arising from individual samples. With these two enhancements, we derive the Generalized Kullback-Leibler (GKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100, ImageNet, and vision-language datasets, focusing on adversarial training, and knowledge distillation tasks. Specifically, we achieve new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive knowledge distillation performance across CIFAR/ImageNet models and CLIP models, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.

Paper Structure

This paper contains 20 sections, 1 theorem, 15 equations, 4 figures, 22 tables, 2 algorithms.

Key Result

Theorem 1

From the perspective of gradient optimization, the Kullback-Leibler (KL) Divergence loss is equivalent to the following Decoupled Kullback-Leibler (DKL) Divergence loss when $\alpha=1$, $\beta =1$, and $\varphi(x_{m},x_{n})=\sqrt{\mathcal{S}(\mathbf{w}_{m})}$: where $\mathcal{S}(\cdot)$ represents stop gradients operation, $\mathbf{s}_{m}^{\top}$ is transpose of $\mathbf{s}_{m}$, $\mathbf{w}_{m}^

Figures (4)

  • Figure 1: We achieve SOTA robustness on CIFAR-100. "star" represents our method while "circle" denotes previous methods. "Black" means adversarial training with image preprocessing only including random crop and flip, "Blue" is for methods with AutoAug or CutMix, and "red" represents methods using synthesized data. AA is short for Auto-Attack croce2020reliable.
  • Figure 2: Comparisons of gradient backpropagation between KL, DKL, and GKL losses. (b) DKL loss is equivalent to (a) KL loss regarding backward optimization. $\mathcal{M}$ and $\mathcal{N}$ can be the same one (like in adversarial training) or two separate (like in knowledge distillation) models determined by application scenarios. Similarly, $x_{m}$, $x_{n}$$\in$$X$ can also be the same one (like in knowledge distillation) or two different (like in adversarial training) images. $o_{m}$, $o_{n}$ are logits output with which the probability vectors are obtained when applying the softmax activation. Solid arrows represent the forward process while dotted arrows indicate the backward process driven by the corresponding loss functions in the same color. $\varphi(x_{m},x_{n})$ is weight function depending on prediction of $x_{m}$. $\varphi^{*}(x_{m},x_{n})$ is our designed smoother weight function. It can be sample-wise or class-wise determined by if class-wise global information is incorporated.
  • Figure 3: Classification models suffer from imbalanced distribution of predicted scores. (a) On ImageNet-LT; (b) On Full ImageNet; The higher the predicted score, the larger the entropy to decrease for knowledge distillation training convergence.
  • Figure 4: Visualization comparisons. (a) t-SNE visualization of the model trained by GKL-AT on CIFAR-100; (b) t-SNE visualization of the model trained by TRADES on CIFAR-100. (c) Class margin differences between models trained by GKL-AT and TRADES.

Theorems & Definitions (1)

  • Theorem 1