Table of Contents
Fetching ...

StableKD: Breaking Inter-block Optimization Entanglement for Stable Knowledge Distillation

Shiu-hong Kao, Jierun Chen, S. H. Gary Chan

TL;DR

StableKD is proposed, a novel KD framework that breaks the IBOE and achieves more stable optimization, and greatly boosts the model accuracy by 1% ~ 18%, speeds up the convergence up to 10 times, and outperforms other KD approaches with only 40% of the training data.

Abstract

Knowledge distillation (KD) has been recognized as an effective tool to compress and accelerate models. However, current KD approaches generally suffer from an accuracy drop and/or an excruciatingly long distillation process. In this paper, we tackle the issue by first providing a new insight into a phenomenon that we call the Inter-Block Optimization Entanglement (IBOE), which makes the conventional end-to-end KD approaches unstable with noisy gradients. We then propose StableKD, a novel KD framework that breaks the IBOE and achieves more stable optimization. StableKD distinguishes itself through two operations: Decomposition and Recomposition, where the former divides a pair of teacher and student networks into several blocks for separate distillation, and the latter progressively merges them back, evolving towards end-to-end distillation. We conduct extensive experiments on CIFAR100, Imagewoof, and ImageNet datasets with various teacher-student pairs. Compared to other KD approaches, our simple yet effective StableKD greatly boosts the model accuracy by 1% ~ 18%, speeds up the convergence up to 10 times, and outperforms them with only 40% of the training data.

StableKD: Breaking Inter-block Optimization Entanglement for Stable Knowledge Distillation

TL;DR

StableKD is proposed, a novel KD framework that breaks the IBOE and achieves more stable optimization, and greatly boosts the model accuracy by 1% ~ 18%, speeds up the convergence up to 10 times, and outperforms other KD approaches with only 40% of the training data.

Abstract

Knowledge distillation (KD) has been recognized as an effective tool to compress and accelerate models. However, current KD approaches generally suffer from an accuracy drop and/or an excruciatingly long distillation process. In this paper, we tackle the issue by first providing a new insight into a phenomenon that we call the Inter-Block Optimization Entanglement (IBOE), which makes the conventional end-to-end KD approaches unstable with noisy gradients. We then propose StableKD, a novel KD framework that breaks the IBOE and achieves more stable optimization. StableKD distinguishes itself through two operations: Decomposition and Recomposition, where the former divides a pair of teacher and student networks into several blocks for separate distillation, and the latter progressively merges them back, evolving towards end-to-end distillation. We conduct extensive experiments on CIFAR100, Imagewoof, and ImageNet datasets with various teacher-student pairs. Compared to other KD approaches, our simple yet effective StableKD greatly boosts the model accuracy by 1% ~ 18%, speeds up the convergence up to 10 times, and outperforms them with only 40% of the training data.
Paper Structure (27 sections, 8 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 27 sections, 8 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of StableKD with vanilla logit-based and feature-based knowledge distillation frameworks.
  • Figure 2: Results of the preliminary experiments. All point to the soundness of our IBOE hypothesis.
  • Figure 3: Overview of our StableKD. It has two key operations, Decomposition and Recomposition. The former divides the teacher and student networks into several pairs of blocks, which then perform the KD process separately for certain epochs. After that, the latter merges the adjacent blocks two by two, followed by a new KD stage.
  • Figure 4: StableKD effectively alleviates the inter-block optimization entanglement.
  • Figure 5: StableKD outperforms others by achieving higher test accuracy with fewer training epochs on CIFAR100 and Imagewoof datasets.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 1: Recomposing Function $\mathcal{R}$