Table of Contents
Fetching ...

Asymmetric Decision-Making in Online Knowledge Distillation:Unifying Consensus and Divergence

Zhaowei Chen, Borui Zhao, Yuchen Ge, Yuhao Chen, Renjie Song, Jiajun Liang

TL;DR

This work tackles the inefficiencies of offline teacher–student distillation by proposing Asymmetric Decision-Making (ADM) for Online Knowledge Distillation (OKD). ADM jointly promotes consensus in foreground regions for the student and encourages divergence in under-explored regions for the teacher, using a pair of region-aware losses $\mathcal{L}_{co}$ and $\mathcal{L}_{di}$ based on feature similarity $\mathcal{S}$. The approach yields consistent improvements across OKD and offline KD tasks, including image classification, semantic segmentation, and diffusion distillation, with minimal added computation. By treating learning as a curriculum from easy to hard regions, ADM bridges the gap between student and teacher representations and enhances feature richness and calibration across diverse settings.

Abstract

Online Knowledge Distillation (OKD) methods streamline the distillation training process into a single stage, eliminating the need for knowledge transfer from a pretrained teacher network to a more compact student network. This paper presents an innovative approach to leverage intermediate spatial representations. Our analysis of the intermediate features from both teacher and student models reveals two pivotal insights: (1) the similar features between students and teachers are predominantly focused on foreground objects. (2) teacher models emphasize foreground objects more than students. Building on these findings, we propose Asymmetric Decision-Making (ADM) to enhance feature consensus learning for student models while continuously promoting feature diversity in teacher models. Specifically, Consensus Learning for student models prioritizes spatial features with high consensus relative to teacher models. Conversely, Divergence Learning for teacher models highlights spatial features with lower similarity compared to student models, indicating superior performance by teacher models in these regions. Consequently, ADM facilitates the student models to catch up with the feature learning process of the teacher models. Extensive experiments demonstrate that ADM consistently surpasses existing OKD methods across various online knowledge distillation settings and also achieves superior results when applied to offline knowledge distillation, semantic segmentation and diffusion distillation tasks.

Asymmetric Decision-Making in Online Knowledge Distillation:Unifying Consensus and Divergence

TL;DR

This work tackles the inefficiencies of offline teacher–student distillation by proposing Asymmetric Decision-Making (ADM) for Online Knowledge Distillation (OKD). ADM jointly promotes consensus in foreground regions for the student and encourages divergence in under-explored regions for the teacher, using a pair of region-aware losses and based on feature similarity . The approach yields consistent improvements across OKD and offline KD tasks, including image classification, semantic segmentation, and diffusion distillation, with minimal added computation. By treating learning as a curriculum from easy to hard regions, ADM bridges the gap between student and teacher representations and enhances feature richness and calibration across diverse settings.

Abstract

Online Knowledge Distillation (OKD) methods streamline the distillation training process into a single stage, eliminating the need for knowledge transfer from a pretrained teacher network to a more compact student network. This paper presents an innovative approach to leverage intermediate spatial representations. Our analysis of the intermediate features from both teacher and student models reveals two pivotal insights: (1) the similar features between students and teachers are predominantly focused on foreground objects. (2) teacher models emphasize foreground objects more than students. Building on these findings, we propose Asymmetric Decision-Making (ADM) to enhance feature consensus learning for student models while continuously promoting feature diversity in teacher models. Specifically, Consensus Learning for student models prioritizes spatial features with high consensus relative to teacher models. Conversely, Divergence Learning for teacher models highlights spatial features with lower similarity compared to student models, indicating superior performance by teacher models in these regions. Consequently, ADM facilitates the student models to catch up with the feature learning process of the teacher models. Extensive experiments demonstrate that ADM consistently surpasses existing OKD methods across various online knowledge distillation settings and also achieves superior results when applied to offline knowledge distillation, semantic segmentation and diffusion distillation tasks.

Paper Structure

This paper contains 23 sections, 12 equations, 11 figures, 21 tables.

Figures (11)

  • Figure 1: Observation of Discrepancy Regions between Teacher and Student Features. The first row shows that discrepancy regions between teacher and student's features are more concentrated on foreground object regions after Vanilla training. In contrast, the second row displays discrepancy regions are mainly found in background regions after ADM training. Best viewed in color.
  • Figure 2: Analysis of the Teacher and Student Models' Intermediate Features. We load a pre-trained ResNet101 model and obtain its CAM as the Target Object Regions (dubbed TOR) on ImageNet. (a) mIoU of CAM regions. We calculate the mean intersection over union (mIoU) between TOR and the CAM of intermediate ResNet-34, ResNet-18 models saved during the training process. We observe that ResNet-34 is more concentrated on object regions than ResNet-18. (b) mIoU of similar and discrepancy feature regions. We observe that the similar regions between ResNet-34 and ResNet-18 predominantly focus on TOR and a substantial proportion of discrepancy regions are also located within the TOR.
  • Figure 3: Asymmetric Decision-Making in Online Knowledge Distillation. GD means Gradient Detach operation. Consensus Learning for student models and Divergence Learning for teacher models. The gray arrow represents the baseline of the DML method, while the blue arrow denotes the newly added ADM component. Best viewed in color.
  • Figure 4: Visualization of the spatial-temporal evolution with high teacher-student similarity. Red indicates high similarity values. Best viewed in color.
  • Figure 5: Visualzaiton. Obviously, ADM (right) results in more discriminative features, reduced prediction discrepancies with teachers, and enhanced calibration than DML (left).
  • ...and 6 more figures