Table of Contents
Fetching ...

Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation

Chonghua Lv, Dong Zhao, Shuang Wang, Dou Quan, Ning Huyan, Nicu Sebe, Zhun Zhong

TL;DR

This work proposes Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization in knowledge distillation and introduces a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs.

Abstract

Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve in-domain accuracy while neglecting out-of-domain generalization, which is essential under distribution shifts. This limitation becomes more severe with the emergence of vision foundation models (VFMs): although VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability. We propose Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization. GKD decouples representation learning from task learning. In the first stage, the student acquires domain-agnostic representations through selective feature distillation, and in the second stage, these representations are frozen for task adaptation, thereby mitigating overfitting to visible domains. To further support transfer, we introduce a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs. Extensive experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods, achieving average gains of +1.9% in foundation-to-foundation (F2F) and +10.6% in foundation-to-local (F2L) distillation. The code will be available at https://github.com/Younger-hua/GKD.

Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation

TL;DR

This work proposes Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization in knowledge distillation and introduces a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs.

Abstract

Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve in-domain accuracy while neglecting out-of-domain generalization, which is essential under distribution shifts. This limitation becomes more severe with the emergence of vision foundation models (VFMs): although VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability. We propose Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization. GKD decouples representation learning from task learning. In the first stage, the student acquires domain-agnostic representations through selective feature distillation, and in the second stage, these representations are frozen for task adaptation, thereby mitigating overfitting to visible domains. To further support transfer, we introduce a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs. Extensive experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods, achieving average gains of +1.9% in foundation-to-foundation (F2F) and +10.6% in foundation-to-local (F2L) distillation. The code will be available at https://github.com/Younger-hua/GKD.
Paper Structure (13 sections, 12 equations, 7 figures, 5 tables)

This paper contains 13 sections, 12 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison of Knowledge Distillation (KD) and our proposed generalizable KD (GKD). Conventional KD preserves accuracy within the same domain but overlooks generalization to unseen domains.
  • Figure 2: Generalization comparison of KD, its enhanced variants (CWD, Af-DCD), and our GKD. GKD consistently outperforms existing KD methods on unseen domains.
  • Figure 3: (a) Limited performance gain with conventional KD methods on unseen domains. Two-stage KD effectively improves the generalization performance of student. (b) Loss curves of various KD methods with DINOv2-B $\to$ ViT-S. Conventional single-stage KD causes oscillations and slower convergence, while two-stage KD exhibits smoother loss decay, indicating more stable optimization.
  • Figure 4: Overview of the proposed GKD framework. GKD comprises two major parts: domain-general distillation and task learning. In the domain-general distillation stage, the student sequentially performs task-agnostic and domain-agnostic distillation, both via the Query-based Soft Distillation mechanism. In the task learning stage, only the decoder is trained on source annotations, while the student encoder is frozen to preserve the domain-general representations.
  • Figure 5: PCA visualization. Feature embedding is extracted from the last layer of encoder. GKD effectively distills the spatial structure information of VFMs.
  • ...and 2 more figures