Table of Contents
Fetching ...

Heterogeneous Generative Knowledge Distillation with Masked Image Modeling

Ziming Wang, Shumin Han, Xiaodi Wang, Jing Hao, Xianbin Cao, Baochang Zhang

TL;DR

This work addresses transferring knowledge from large Masked Image Modeling (MIM) transformers to compact CNNs, a challenging problem due to masked inputs and data-distribution shifts. It introduces Heterogeneous Generative Knowledge Distillation (H-GKD), a bridge that enables a sparse CNN student with a UNet-like decoder to learn the visual representations and data distribution encoded by a heterogeneous MIM teacher. The learning objective combines a feature imitation term $\mathbf{L}_{feat}$ and a similarity-distribution term $\mathbf{L}_{sim}$ computed via a memory queue with Pearson correlation, formalized as $\hat{\theta}=\arg\min (\mathbf{L}_{sim}+\mathbf{L}_{feat})$. Empirically, H-GKD delivers state-of-the-art results on ImageNet-1K and MS COCO for classification, detection, and segmentation, exemplified by improving a sparse ResNet-50 from $76.98\%$ to $80.01\%$ Top-1 accuracy on ImageNet under certain settings.

Abstract

Small CNN-based models usually require transferring knowledge from a large model before they are deployed in computationally resource-limited edge devices. Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models. The reason is mainly due to the significant discrepancy between the Transformer-based large model and the CNN-based small network. In this paper, we develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion. Our method builds a bridge between Transformer-based models and CNNs by training a UNet-style student with sparse convolution, which can effectively mimic the visual representation inferred by a teacher over masked modeling. Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models, which can be pre-trained using advanced generative methods. Extensive experiments show that it adapts well to various models and sizes, consistently achieving state-of-the-art performance in image classification, object detection, and semantic segmentation tasks. For example, in the Imagenet 1K dataset, H-GKD improves the accuracy of Resnet50 (sparse) from 76.98% to 80.01%.

Heterogeneous Generative Knowledge Distillation with Masked Image Modeling

TL;DR

This work addresses transferring knowledge from large Masked Image Modeling (MIM) transformers to compact CNNs, a challenging problem due to masked inputs and data-distribution shifts. It introduces Heterogeneous Generative Knowledge Distillation (H-GKD), a bridge that enables a sparse CNN student with a UNet-like decoder to learn the visual representations and data distribution encoded by a heterogeneous MIM teacher. The learning objective combines a feature imitation term and a similarity-distribution term computed via a memory queue with Pearson correlation, formalized as . Empirically, H-GKD delivers state-of-the-art results on ImageNet-1K and MS COCO for classification, detection, and segmentation, exemplified by improving a sparse ResNet-50 from to Top-1 accuracy on ImageNet under certain settings.

Abstract

Small CNN-based models usually require transferring knowledge from a large model before they are deployed in computationally resource-limited edge devices. Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models. The reason is mainly due to the significant discrepancy between the Transformer-based large model and the CNN-based small network. In this paper, we develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion. Our method builds a bridge between Transformer-based models and CNNs by training a UNet-style student with sparse convolution, which can effectively mimic the visual representation inferred by a teacher over masked modeling. Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models, which can be pre-trained using advanced generative methods. Extensive experiments show that it adapts well to various models and sizes, consistently achieving state-of-the-art performance in image classification, object detection, and semantic segmentation tasks. For example, in the Imagenet 1K dataset, H-GKD improves the accuracy of Resnet50 (sparse) from 76.98% to 80.01%.
Paper Structure (17 sections, 8 equations, 4 figures, 5 tables)

This paper contains 17 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The distribution shift. The above figures show the distribution of pixel intensity histograms of the student and the teacher. If (b) straightforwardly feed masked images to CNNs (like Resnet he2016deep, Mobilenet howard2017mobilenets), the data distribution will shift at scale (shown in green). In contrast, (a) Transformer-based generative methods (like MAE he2022masked, BEiT bao2021beit) have no such side effect due to the ability to process variable-length input (shown in blue).
  • Figure 2: The structure of H-GKD. We first pre-train a MIM model, which could be MAE he2022masked or BEiT bao2021beit as the teacher model. Then a sparse CNN-based student network would be trained to learn the visual representation of the teacher by minimizing the loss (Section \ref{['Methods:Similarity']}).
  • Figure 3: Sparse convolution for the masked image. As the above traditional convolution shows, when applying a kernel centered at a zero pixel, the result could be non-zero if any non-zero pixel exists in the filter. The masked region will be eroded, and the visible one will extend after repeating the convolution. However, sparse convolution would skip the masked pixel, keeping the mask pattern.
  • Figure 4: The pipeline of our distillation method. The teacher model is pre-trained and frozen during distillation. The student model is trained by 1) the MSE loss between features of the teacher and the student and 2) the Pearson coefficient loss between the similarity distribution of the teacher and the student.