Table of Contents
Fetching ...

FerKD: Surgical Label Adaptation for Efficient Distillation

Zhiqiang Shen

TL;DR

This work introduces a stabilized SelfMix augmentation that weakens the variation of the mixed images and corresponding soft labels through mixing similar regions within the same image, and demonstrates empirically that this method can dramatically improve the convergence speed and final accuracy.

Abstract

We present FerKD, a novel efficient knowledge distillation framework that incorporates partial soft-hard label adaptation coupled with a region-calibration mechanism. Our approach stems from the observation and intuition that standard data augmentations, such as RandomResizedCrop, tend to transform inputs into diverse conditions: easy positives, hard positives, or hard negatives. In traditional distillation frameworks, these transformed samples are utilized equally through their predictive probabilities derived from pretrained teacher models. However, merely relying on prediction values from a pretrained teacher, a common practice in prior studies, neglects the reliability of these soft label predictions. To address this, we propose a new scheme that calibrates the less-confident regions to be the context using softened hard groundtruth labels. Our approach involves the processes of hard regions mining + calibration. We demonstrate empirically that this method can dramatically improve the convergence speed and final accuracy. Additionally, we find that a consistent mixing strategy can stabilize the distributions of soft supervision, taking advantage of the soft labels. As a result, we introduce a stabilized SelfMix augmentation that weakens the variation of the mixed images and corresponding soft labels through mixing similar regions within the same image. FerKD is an intuitive and well-designed learning system that eliminates several heuristics and hyperparameters in former FKD solution. More importantly, it achieves remarkable improvement on ImageNet-1K and downstream tasks. For instance, FerKD achieves 81.2% on ImageNet-1K with ResNet-50, outperforming FKD and FunMatch by remarkable margins. Leveraging better pre-trained weights and larger architectures, our finetuned ViT-G14 even achieves 89.9%. Our code is available at https://github.com/szq0214/FKD/tree/main/FerKD.

FerKD: Surgical Label Adaptation for Efficient Distillation

TL;DR

This work introduces a stabilized SelfMix augmentation that weakens the variation of the mixed images and corresponding soft labels through mixing similar regions within the same image, and demonstrates empirically that this method can dramatically improve the convergence speed and final accuracy.

Abstract

We present FerKD, a novel efficient knowledge distillation framework that incorporates partial soft-hard label adaptation coupled with a region-calibration mechanism. Our approach stems from the observation and intuition that standard data augmentations, such as RandomResizedCrop, tend to transform inputs into diverse conditions: easy positives, hard positives, or hard negatives. In traditional distillation frameworks, these transformed samples are utilized equally through their predictive probabilities derived from pretrained teacher models. However, merely relying on prediction values from a pretrained teacher, a common practice in prior studies, neglects the reliability of these soft label predictions. To address this, we propose a new scheme that calibrates the less-confident regions to be the context using softened hard groundtruth labels. Our approach involves the processes of hard regions mining + calibration. We demonstrate empirically that this method can dramatically improve the convergence speed and final accuracy. Additionally, we find that a consistent mixing strategy can stabilize the distributions of soft supervision, taking advantage of the soft labels. As a result, we introduce a stabilized SelfMix augmentation that weakens the variation of the mixed images and corresponding soft labels through mixing similar regions within the same image. FerKD is an intuitive and well-designed learning system that eliminates several heuristics and hyperparameters in former FKD solution. More importantly, it achieves remarkable improvement on ImageNet-1K and downstream tasks. For instance, FerKD achieves 81.2% on ImageNet-1K with ResNet-50, outperforming FKD and FunMatch by remarkable margins. Leveraging better pre-trained weights and larger architectures, our finetuned ViT-G14 even achieves 89.9%. Our code is available at https://github.com/szq0214/FKD/tree/main/FerKD.
Paper Structure (18 sections, 4 equations, 10 figures, 14 tables)

This paper contains 18 sections, 4 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Illustration of motivation for FerKD. The left figure depicts the original input, and the middle figure shows the center points of bounding boxes generated using RandomResizedCrop. The radius of each circle corresponds to the area of the bounding box. It can be observed that the center points of the bounding boxes are concentrated in the center of the image, and their area increases as they approach the center. The right figure displays several top and bottom confident bounding boxes and their corresponding predictive probabilities from a pre-trained teacher or teachers ensemble. The proposed hard region calibration strategy is established based on these predictions.
  • Figure 2: Statistics of soft label max-probability for crops on ImageNet-1K. The soft label is from FKD shen2022fast. In each image, 500 regions are randomly cropped.
  • Figure 3: Illustration of region calibration according to their predictive probabilities in FerKD. Left is the input image with RandomResizedCrop. Bounding box is with high probability and bounding box is with low probability.
  • Figure 4: Illustration of region calibration according to their predictive probabilities in FerKD. Left is the input image with RandomResizedCrop. Right is the rule for calibrating the probabilities of regions.
  • Figure 5: Minimal and maximal probability. The upper figure indicates that only regions having the max probability in $[\text{minimal}, 1.0]$ will be trained, and baseline indicates that the model is trained with all randomly sampled regions. The bottom figure indicates that only regions having the max probability in $[0.3, \text{maximal}]$ will be trained, and baseline indicates that the model is trained with regions in $[0.3, 1.0]$.
  • ...and 5 more figures