Table of Contents
Fetching ...

CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation

Pardis Taghavi, Tian Liu, Renjie Li, Reza Langari, Zhengzhong Tu

TL;DR

CAST addresses the high cost of pixel-level instance segmentation by distilling large vision foundation models into compact experts through a three-stage semi-supervised knowledge distillation pipeline that couples domain-adaptive self-training with an instance-aware pixel-wise contrastive loss. The approach unifies teacher adaptation, knowledge transfer, and student refinement under a single objective, leveraging both labeled and unlabeled data to sharpen masks and improve per-pixel predictions. Empirical results on Cityscapes and ADE20K show substantial gains for the compact student over zero-shot and adapted teachers, while reducing model size and compute relative to the baselines. The core contribution is the instance-aware contrastive signal, which strengthens inter-instance separation and enables effective ultra-compact segmentation in low-label regimes, with potential impact on deployable vision systems in robotics and autonomous driving.

Abstract

Instance segmentation demands costly per-pixel annotations and computationally expensive models. We introduce CAST, a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFM) into compact experts using limited labeled and abundant unlabeled data. CAST unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to CAST is an \emph{instance-aware pixel-wise contrastive loss} that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and fully leverage unlabeled images. On Cityscapes and ADE20K, our ~11x smaller student improves over its zero-shot VFM teacher(s) by +8.5 and +7.1 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and further outperforms state-of-the-art SSKD methods on both benchmarks.

CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation

TL;DR

CAST addresses the high cost of pixel-level instance segmentation by distilling large vision foundation models into compact experts through a three-stage semi-supervised knowledge distillation pipeline that couples domain-adaptive self-training with an instance-aware pixel-wise contrastive loss. The approach unifies teacher adaptation, knowledge transfer, and student refinement under a single objective, leveraging both labeled and unlabeled data to sharpen masks and improve per-pixel predictions. Empirical results on Cityscapes and ADE20K show substantial gains for the compact student over zero-shot and adapted teachers, while reducing model size and compute relative to the baselines. The core contribution is the instance-aware contrastive signal, which strengthens inter-instance separation and enables effective ultra-compact segmentation in low-label regimes, with potential impact on deployable vision systems in robotics and autonomous driving.

Abstract

Instance segmentation demands costly per-pixel annotations and computationally expensive models. We introduce CAST, a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFM) into compact experts using limited labeled and abundant unlabeled data. CAST unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to CAST is an \emph{instance-aware pixel-wise contrastive loss} that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and fully leverage unlabeled images. On Cityscapes and ADE20K, our ~11x smaller student improves over its zero-shot VFM teacher(s) by +8.5 and +7.1 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and further outperforms state-of-the-art SSKD methods on both benchmarks.

Paper Structure

This paper contains 20 sections, 1 theorem, 22 equations, 7 figures, 14 tables.

Key Result

Proposition 3.1

Under Assumption assump:sampling, one gradient update on $\mathcal{L}_{\rm pxl}$ increases the expected inter-instance margin $\Delta_{\rm emp}$ by This expectation holds even when pseudo-labels are imperfect, provided negatives are sampled using our instance aware strategy.

Figures (7)

  • Figure 1: CAST framework overview.Top: Three-stage pipeline: (1) adapt a pre-trained VFM teacher to the target domain via self-training with pixel-level contrastive calibration; (2) distill knowledge into a compact student using instance-aware contrastive sampling; (3) fine-tune the student on labeled data to correct residual pseudo-label bias. Bottom: Detailed view of stage (2): fused mask and class score maps produce anchor pixels, sampled across weak/strong views to form positive/negative pairs; an MLP projects features for the contrastive loss. Dashed arrows denote no gradient flow; red modules are trainable, blue are frozen.
  • Figure 2: Efficiency comparison (log scale).
  • Figure 3: (Left) Empirical margin (NegMean–PosMean) every 10k iterations for various $\lambda_{\rm pxl}$. (Center) False negative rate ($\mathrm{FNR}$) for $\lambda_{\rm pxl}=0.1$, dashed at $p=0.5$. (Right) Contrastive loss for $\lambda_{\rm pxl}=0.1$.
  • Figure 4: Qualitative results on Cityscapes. Guided dist. berrada2024guided (top) vs. CAST (bottom).
  • Figure 5: Performance–complexity radar chart (normalized).
  • ...and 2 more figures

Theorems & Definitions (3)

  • Proposition 3.1: Expected Margin Growth
  • proof : Proof Sketch
  • Remark 8.1: Why $\langle z^+, z^-\rangle \approx 0$ holds