Decoupled Kullback-Leibler Divergence Loss

Jiequan Cui; Zhuotao Tian; Zhisheng Zhong; Xiaojuan Qi; Bei Yu; Hanwang Zhang

Decoupled Kullback-Leibler Divergence Loss

Jiequan Cui, Zhuotao Tian, Zhisheng Zhong, Xiaojuan Qi, Bei Yu, Hanwang Zhang

TL;DR

The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive performance on knowledge distillation, demonstrating the substantial practical merits.

Abstract

In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of 1) a weighted Mean Square Error (wMSE) loss and 2) a Cross-Entropy loss incorporating soft labels. Thanks to the decomposed formulation of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL/DKL in scenarios like knowledge distillation by breaking its asymmetric optimization property. This modification ensures that the $\mathbf{w}$MSE component is always effective during training, providing extra constructive cues. Secondly, we introduce class-wise global information into KL/DKL to mitigate bias from individual samples. With these two enhancements, we derive the Improved Kullback-Leibler (IKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100 and ImageNet datasets, focusing on adversarial training, and knowledge distillation tasks. The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive performance on knowledge distillation, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.

Decoupled Kullback-Leibler Divergence Loss

TL;DR

Abstract

MSE component is always effective during training, providing extra constructive cues. Secondly, we introduce class-wise global information into KL/DKL to mitigate bias from individual samples. With these two enhancements, we derive the Improved Kullback-Leibler (IKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100 and ImageNet datasets, focusing on adversarial training, and knowledge distillation tasks. The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive performance on knowledge distillation, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.

Paper Structure (21 sections, 1 theorem, 16 equations, 3 figures, 18 tables, 2 algorithms)

This paper contains 21 sections, 1 theorem, 16 equations, 3 figures, 18 tables, 2 algorithms.

Introduction
Related Work
Method
Preliminary
Decoupled Kullback-Leibler Divergence Loss
Improved Kullback-Leibler Divergence Loss
A Case Study and Analysis
Experiments
Adversarial Robustness
Knowledge Distillation
Knowledge Distillation on Imbalanced Data
Ablation Studies
Conclusion and Limitation
Appendix
Proof to Theorem \ref{['thm:thm_dkl']}
...and 6 more sections

Key Result

Theorem 1

From the perspective of gradient optimization, the Kullback-Leibler (KL) Divergence loss is equivalent to the following Decoupled Kullback-Leibler (DKL) Divergence loss when $\alpha=1$ and $\beta =1$. where $\mathcal{S}(\cdot)$ represents stop gradients operation, $\mathbf{s}_{m}^{\top}$ is transpose of $\mathbf{s}_{m}$, $\mathbf{w}_{m}^{j,k}$ = $\mathbf{s}_{m}^{j} * \mathbf{s}_{m}^{k}$, $\Delta

Figures (3)

Figure 1: Comparisons of gradient backpropagation between KL, DKL, and IKL losses. DKL loss is equivalent to KL loss regarding backward optimization. $\mathcal{M}$ and $\mathcal{N}$ can be the same one (like in adversarial training) or two separate (like in knowledge distillation) models determined by application scenarios. Similarly, $x_{m}$, $x_{n}$$\in$$X$ can also be the same one (like in knowledge distillation) or two different (like in adversarial training) images. $o_{m}$, $o_{n}$ are logits output with which the probability vectors are obtained when applying the softmax activation. Black arrows represent the forward process while colored arrows indicate the backward process driven by the corresponding loss functions in the same color. "$\mathbf{w}$MSE" is a weighted Mean Square Error (MSE) loss. "$\mathbf{\bar{w}}$MSE" is incorporated with class-wise global information.
Figure 2: We achieve SOTA robustness on CIFAR-100. "star" represents our method while "circle" denotes previous methods. "Black" means adversarial training with image preprocessing only including random crop and flip, "Blue" is for methods with AutoAug or CutMix, and "red" represents methods using synthesized data. AA is short for Auto-Attack croce2020reliable.
Figure 3: Visualization comparisons. (a) t-SNE visualization of the model trained by IKL-AT on CIFAR-100; (b) t-SNE visualization of the model trained by TRADES on CIFAR-100. (c) Class margin differences between models trained by IKL-AT and TRADES.

Theorems & Definitions (1)

Theorem 1

Decoupled Kullback-Leibler Divergence Loss

TL;DR

Abstract

Decoupled Kullback-Leibler Divergence Loss

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (1)