Table of Contents
Fetching ...

Training a Student Expert via Semi-Supervised Foundation Model Distillation

Pardis Taghavi, Tian Liu, Renjie Li, Reza Langari, Zhengzhong Tu

Abstract

Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our $\approx 11\times$ smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.

Training a Student Expert via Semi-Supervised Foundation Model Distillation

Abstract

Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.

Paper Structure

This paper contains 25 sections, 1 theorem, 24 equations, 9 figures, 14 tables.

Key Result

Proposition 3.1

Under Assumption assump:sampling, one gradient update on $\mathcal{L}_{\rm pxl}$ increases the expected inter-instance margin $\Delta_{\rm emp}$ by This expectation holds even when pseudo-labels are imperfect, provided negatives are sampled using our instance aware strategy. $\blacktriangleleft$$\blacktriangleleft$

Figures (9)

  • Figure 1: Framework overview.Top: Three-stage pipeline: (1) adapt a pre-trained VFM teacher to the target domain via self-training with pixel-level contrastive calibration; (2) distill knowledge into a compact student using instance-aware contrastive sampling; (3) fine-tune the student on labeled data to correct residual pseudo-label bias. Bottom: Detailed view of stage (2): fused mask and class score maps produce anchor pixels, sampled across weak/strong views to form positive/negative pairs; an MLP projects features for the contrastive loss. Dashed arrows denote no gradient flow; red modules are trainable, blue are frozen.
  • Figure 2: Efficiency comparison (log scale).
  • Figure 3: Left: Empirical margin ($\mathrm{NegMean}-\mathrm{PosMean}$) measured every 10k iterations for different values of $\lambda_{\rm pxl}$. Center: False negative rate ($\mathrm{FNR}$) for $\lambda_{\rm pxl}=0.1$, with the dashed line marking $p=0.5$. Right: Contrastive loss for $\lambda_{\rm pxl}=0.1$.
  • Figure 4: Qualitative results on Cityscapes. Guided dist. berrada2024guided (top) versus our method (bottom).
  • Figure 5: Qualitative bias reduction in stage-wise distillation. Top row: pseudo-labels generated by the adapted teacher. Bottom row: student predictions after distillation and refinement, showing reduced pseudo-label bias and sharper instance boundaries.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Proposition 3.1: Expected Margin Growth
  • proof : Proof Sketch
  • Remark 8.1: Why $\langle z^+, z^-\rangle \approx 0$ holds