Table of Contents
Fetching ...

CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation

Jungsoo Lee, Debasmit Das, Munawar Hayat, Sungha Choi, Kyuwoong Hwang, Fatih Porikli

TL;DR

This work addresses the gap between large vision foundation models (LVFMs) and edge models by presenting CustomKD, a two-stage knowledge distillation framework that first customizes LVFM features to the student via a shared head, and then distills knowledge from both the original and customized teacher features. By alternating feature customization and KD, and by employing both task-general and task-specific supervision, CustomKD overcomes the large discrepancy between teacher and student architectures and backbones without altering inference. Empirically, it achieves state-of-the-art or competitive results on unsupervised domain adaptation and semi-supervised learning, across diverse datasets and teacher backbones, while preserving edge-model efficiency. The method promises practical impact for deploying high-performing edge models in real-world settings by leveraging unlabeled data and LVFMs without additional inference costs.

Abstract

We propose a novel knowledge distillation approach, CustomKD, that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models (e.g., MobileNetV3). Despite recent advancements in LVFMs, such as DINOv2 and CLIP, their potential in knowledge distillation for enhancing edge models remains underexplored. While knowledge distillation is a promising approach for improving the performance of edge models, the discrepancy in model capacities and heterogeneous architectures between LVFMs and edge models poses a significant challenge. Our observation indicates that although utilizing larger backbones (e.g., ViT-S to ViT-L) in teacher models improves their downstream task performances, the knowledge distillation from the large teacher models fails to bring as much performance gain for student models as for teacher models due to the large model discrepancy. Our simple yet effective CustomKD customizes the well-generalized features inherent in LVFMs to a given student model in order to reduce model discrepancies. Specifically, beyond providing well-generalized original knowledge from teachers, CustomKD aligns the features of teachers to those of students, making it easy for students to understand and overcome the large model discrepancy overall. CustomKD significantly improves the performances of edge models in scenarios with unlabeled data such as unsupervised domain adaptation (e.g., OfficeHome and DomainNet) and semi-supervised learning (e.g., CIFAR-100 with 400 labeled samples and ImageNet with 1% labeled samples), achieving the new state-of-the-art performances.

CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation

TL;DR

This work addresses the gap between large vision foundation models (LVFMs) and edge models by presenting CustomKD, a two-stage knowledge distillation framework that first customizes LVFM features to the student via a shared head, and then distills knowledge from both the original and customized teacher features. By alternating feature customization and KD, and by employing both task-general and task-specific supervision, CustomKD overcomes the large discrepancy between teacher and student architectures and backbones without altering inference. Empirically, it achieves state-of-the-art or competitive results on unsupervised domain adaptation and semi-supervised learning, across diverse datasets and teacher backbones, while preserving edge-model efficiency. The method promises practical impact for deploying high-performing edge models in real-world settings by leveraging unlabeled data and LVFMs without additional inference costs.

Abstract

We propose a novel knowledge distillation approach, CustomKD, that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models (e.g., MobileNetV3). Despite recent advancements in LVFMs, such as DINOv2 and CLIP, their potential in knowledge distillation for enhancing edge models remains underexplored. While knowledge distillation is a promising approach for improving the performance of edge models, the discrepancy in model capacities and heterogeneous architectures between LVFMs and edge models poses a significant challenge. Our observation indicates that although utilizing larger backbones (e.g., ViT-S to ViT-L) in teacher models improves their downstream task performances, the knowledge distillation from the large teacher models fails to bring as much performance gain for student models as for teacher models due to the large model discrepancy. Our simple yet effective CustomKD customizes the well-generalized features inherent in LVFMs to a given student model in order to reduce model discrepancies. Specifically, beyond providing well-generalized original knowledge from teachers, CustomKD aligns the features of teachers to those of students, making it easy for students to understand and overcome the large model discrepancy overall. CustomKD significantly improves the performances of edge models in scenarios with unlabeled data such as unsupervised domain adaptation (e.g., OfficeHome and DomainNet) and semi-supervised learning (e.g., CIFAR-100 with 400 labeled samples and ImageNet with 1% labeled samples), achieving the new state-of-the-art performances.

Paper Structure

This paper contains 21 sections, 6 equations, 4 figures, 15 tables, 1 algorithm.

Figures (4)

  • Figure 1: Limited performance gain with larger teachers. While utilizing small teachers (e.g., ViT-S, ViT-B) brings comparable or better performance than the teacher's performance, existing KD methods fail to further improve student's performance with large teachers (e.g., ViT-L). We use FitNet fitnet, Soft Target soft_target, Logits logits, and Decoupled KD decoupled_kd for conventional KD methods.
  • Figure 2: Overall framework of CustomKD. In the feature customization stage, we customize the well-generalized features of LVFMs to a given edge model using its head classifier ($\theta^c_s$). In the KD stage, we enforce the edge model to imitate the 1) task-general feature and 2) customized task-specific feature from the teachers. We alternate these two stages every epoch throughout the training process.
  • Figure 3: t-SNE visualization of $\tilde{f_t}$ (red) and $f_s$ (blue) on OfficeHome. For each domain, the left and right indicates training with randomly initialized $\theta^c_t$ and $\theta^c_s$, respectively, for the head classifier.
  • Figure 4: Consistent performance gains of using CustomKD compared to FitNet across diverse teachers and backbone scales.