Table of Contents
Fetching ...

Decoupled Kullback-Leibler Divergence Loss

Jiequan Cui, Zhuotao Tian, Zhisheng Zhong, Xiaojuan Qi, Bei Yu, Hanwang Zhang

TL;DR

The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive performance on knowledge distillation, demonstrating the substantial practical merits.

Abstract

In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of 1) a weighted Mean Square Error (wMSE) loss and 2) a Cross-Entropy loss incorporating soft labels. Thanks to the decomposed formulation of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL/DKL in scenarios like knowledge distillation by breaking its asymmetric optimization property. This modification ensures that the $\mathbf{w}$MSE component is always effective during training, providing extra constructive cues. Secondly, we introduce class-wise global information into KL/DKL to mitigate bias from individual samples. With these two enhancements, we derive the Improved Kullback-Leibler (IKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100 and ImageNet datasets, focusing on adversarial training, and knowledge distillation tasks. The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive performance on knowledge distillation, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.

Decoupled Kullback-Leibler Divergence Loss

TL;DR

The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive performance on knowledge distillation, demonstrating the substantial practical merits.

Abstract

In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of 1) a weighted Mean Square Error (wMSE) loss and 2) a Cross-Entropy loss incorporating soft labels. Thanks to the decomposed formulation of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL/DKL in scenarios like knowledge distillation by breaking its asymmetric optimization property. This modification ensures that the MSE component is always effective during training, providing extra constructive cues. Secondly, we introduce class-wise global information into KL/DKL to mitigate bias from individual samples. With these two enhancements, we derive the Improved Kullback-Leibler (IKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100 and ImageNet datasets, focusing on adversarial training, and knowledge distillation tasks. The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive performance on knowledge distillation, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.
Paper Structure (21 sections, 1 theorem, 16 equations, 3 figures, 18 tables, 2 algorithms)

This paper contains 21 sections, 1 theorem, 16 equations, 3 figures, 18 tables, 2 algorithms.

Key Result

Theorem 1

From the perspective of gradient optimization, the Kullback-Leibler (KL) Divergence loss is equivalent to the following Decoupled Kullback-Leibler (DKL) Divergence loss when $\alpha=1$ and $\beta =1$. where $\mathcal{S}(\cdot)$ represents stop gradients operation, $\mathbf{s}_{m}^{\top}$ is transpose of $\mathbf{s}_{m}$, $\mathbf{w}_{m}^{j,k}$ = $\mathbf{s}_{m}^{j} * \mathbf{s}_{m}^{k}$, $\Delta

Figures (3)

  • Figure 1: Comparisons of gradient backpropagation between KL, DKL, and IKL losses. DKL loss is equivalent to KL loss regarding backward optimization. $\mathcal{M}$ and $\mathcal{N}$ can be the same one (like in adversarial training) or two separate (like in knowledge distillation) models determined by application scenarios. Similarly, $x_{m}$, $x_{n}$$\in$$X$ can also be the same one (like in knowledge distillation) or two different (like in adversarial training) images. $o_{m}$, $o_{n}$ are logits output with which the probability vectors are obtained when applying the softmax activation. Black arrows represent the forward process while colored arrows indicate the backward process driven by the corresponding loss functions in the same color. "$\mathbf{w}$MSE" is a weighted Mean Square Error (MSE) loss. "$\mathbf{\bar{w}}$MSE" is incorporated with class-wise global information.
  • Figure 2: We achieve SOTA robustness on CIFAR-100. "star" represents our method while "circle" denotes previous methods. "Black" means adversarial training with image preprocessing only including random crop and flip, "Blue" is for methods with AutoAug or CutMix, and "red" represents methods using synthesized data. AA is short for Auto-Attack croce2020reliable.
  • Figure 3: Visualization comparisons. (a) t-SNE visualization of the model trained by IKL-AT on CIFAR-100; (b) t-SNE visualization of the model trained by TRADES on CIFAR-100. (c) Class margin differences between models trained by IKL-AT and TRADES.

Theorems & Definitions (1)

  • Theorem 1