On Good Practices for Task-Specific Distillation of Large Pretrained Visual Models
Juliette Marrie, Michael Arbel, Julien Mairal, Diane Larlus
TL;DR
The paper tackles how to transfer knowledge from large pretrained visual models to compact, task-specific students in resource-constrained settings. It proposes a two-step pipeline: probe a frozen teacher with a task head to form $f_t$, then distill to a smaller $f_s$ using a combined loss $L(f_s) = (1-\alpha)L_{task}(f_s) + \alpha L_{distill}(f_s,f_t)$, where $L_{distill}$ is a KL-based objective with temperature $T$, evaluated over an augmented dataset $\mathcal{D}$ that includes synthetic images generated by Stable Diffusion via ImageMixer. A key contribution is demonstrating that a probed, potentially less accurate teacher often yields better distillation than finetuning the teacher, and that synthetic data for distillation—used only in $L_{distill}$—significantly improves performance across classification, fine-grained classification, and segmentation. The work extends these findings to different architectures and even models trained with different paradigms (e.g., EVA-02), providing practical guidelines for distillation that avoid expensive fine-tuning and leverage prompt-free diffusion-based augmentation. Overall, the paper offers a robust, scalable framework for task-specific distillation with clear empirical benefits and broad applicability.
Abstract
Large pretrained visual models exhibit remarkable generalization across diverse recognition tasks. Yet, real-world applications often demand compact models tailored to specific problems. Variants of knowledge distillation have been devised for such a purpose, enabling task-specific compact models (the students) to learn from a generic large pretrained one (the teacher). In this paper, we show that the excellent robustness and versatility of recent pretrained models challenge common practices established in the literature, calling for a new set of optimal guidelines for task-specific distillation. To address the lack of samples in downstream tasks, we also show that a variant of Mixup based on stable diffusion complements standard data augmentation. This strategy eliminates the need for engineered text prompts and improves distillation of generic models into streamlined specialized networks.
