Table of Contents
Fetching ...

MEDiC: Multi-objective Exploration of Distillation from CLIP

Konstantinos Georgiou, Maofeng Tang, Hairong Qi

Abstract

Masked image modeling (MIM) methods typically operate in either raw pixel space (reconstructing masked patches) or latent feature space (aligning with a pre-trained teacher). We present MEDiC (Multi-objective Exploration of Distillation from CLIP), a framework that combines both spaces in a single pipeline through three complementary objectives: patch-level token distillation from a frozen CLIP encoder, global CLS alignment, and pixel reconstruction via a lightweight decoder. We conduct a systematic investigation of the design space surrounding this multi-objective framework. First, we show that all three objectives provide complementary information, with the full combination reaching 73.9% kNN accuracy on ImageNet-1K. Second, we introduce hierarchical clustering with relative position bias for evolved masking and find that, despite producing more semantically coherent masks than prior methods, evolved masking does not outperform simple block masking in the teacher-guided distillation setting, a finding we attribute to the teacher's inherent semantic awareness. Third, we reveal that optimal scalar loss weights are extremely fragile, with small perturbations causing drops of up to 17 percentage points in kNN accuracy. Our framework achieves 73.9% kNN and 85.1% fine-tuning accuracy with ViT-Base at 300 epochs.

MEDiC: Multi-objective Exploration of Distillation from CLIP

Abstract

Masked image modeling (MIM) methods typically operate in either raw pixel space (reconstructing masked patches) or latent feature space (aligning with a pre-trained teacher). We present MEDiC (Multi-objective Exploration of Distillation from CLIP), a framework that combines both spaces in a single pipeline through three complementary objectives: patch-level token distillation from a frozen CLIP encoder, global CLS alignment, and pixel reconstruction via a lightweight decoder. We conduct a systematic investigation of the design space surrounding this multi-objective framework. First, we show that all three objectives provide complementary information, with the full combination reaching 73.9% kNN accuracy on ImageNet-1K. Second, we introduce hierarchical clustering with relative position bias for evolved masking and find that, despite producing more semantically coherent masks than prior methods, evolved masking does not outperform simple block masking in the teacher-guided distillation setting, a finding we attribute to the teacher's inherent semantic awareness. Third, we reveal that optimal scalar loss weights are extremely fragile, with small perturbations causing drops of up to 17 percentage points in kNN accuracy. Our framework achieves 73.9% kNN and 85.1% fine-tuning accuracy with ViT-Base at 300 epochs.

Paper Structure

This paper contains 23 sections, 10 equations, 7 figures, 13 tables, 1 algorithm.

Figures (7)

  • Figure 1: Four paradigms in masked image modeling. Top-left: raw-space pixel reconstruction (MAE-style). Top-right: latent-space prediction with discrete visual tokens (BEiT-style). Bottom-left: latent-space distillation at the patch level from a teacher (MaskDistill-style). Bottom-right: MEDiC combines pixel reconstruction with both patch-level and CLS-level distillation from a frozen CLIP teacher.
  • Figure 2: Masking strategies in MIM. (a-c) Three standard approaches: grid, random, and block masking with different mask ratios. (d) Evolved masking uses attention-guided clustering to produce semantically coherent mask patterns that adapt during training.
  • Figure 3: MEDiC achieves strong kNN and fine-tuning performance through multi-objective distillation from CLIP, outperforming methods that operate in either raw or latent space alone.
  • Figure 4: Loss weight sensitivity. (a) Pixel reconstruction weight (DLW) has a sharp optimum at 0.01; higher values degrade kNN by up to 17 points. (b) CLS alignment weight (CLW) peaks at 0.30 with a sudden drop at 0.20. Both curves reveal the fragility of global scalar weights. The combined optimum (DLW=0.01, CLW=0.30) yields 73.92% kNN and 85.07% fine-tuning. Full sweep data in the Appendix.
  • Figure 5: Evolved masking across training epochs. For each input image, the top row shows EM-based masks and the bottom row shows HC-based masks. HC produces more spatially coherent groupings that align with semantic content.
  • ...and 2 more figures