Table of Contents
Fetching ...

On Good Practices for Task-Specific Distillation of Large Pretrained Visual Models

Juliette Marrie, Michael Arbel, Julien Mairal, Diane Larlus

TL;DR

The paper tackles how to transfer knowledge from large pretrained visual models to compact, task-specific students in resource-constrained settings. It proposes a two-step pipeline: probe a frozen teacher with a task head to form $f_t$, then distill to a smaller $f_s$ using a combined loss $L(f_s) = (1-\alpha)L_{task}(f_s) + \alpha L_{distill}(f_s,f_t)$, where $L_{distill}$ is a KL-based objective with temperature $T$, evaluated over an augmented dataset $\mathcal{D}$ that includes synthetic images generated by Stable Diffusion via ImageMixer. A key contribution is demonstrating that a probed, potentially less accurate teacher often yields better distillation than finetuning the teacher, and that synthetic data for distillation—used only in $L_{distill}$—significantly improves performance across classification, fine-grained classification, and segmentation. The work extends these findings to different architectures and even models trained with different paradigms (e.g., EVA-02), providing practical guidelines for distillation that avoid expensive fine-tuning and leverage prompt-free diffusion-based augmentation. Overall, the paper offers a robust, scalable framework for task-specific distillation with clear empirical benefits and broad applicability.

Abstract

Large pretrained visual models exhibit remarkable generalization across diverse recognition tasks. Yet, real-world applications often demand compact models tailored to specific problems. Variants of knowledge distillation have been devised for such a purpose, enabling task-specific compact models (the students) to learn from a generic large pretrained one (the teacher). In this paper, we show that the excellent robustness and versatility of recent pretrained models challenge common practices established in the literature, calling for a new set of optimal guidelines for task-specific distillation. To address the lack of samples in downstream tasks, we also show that a variant of Mixup based on stable diffusion complements standard data augmentation. This strategy eliminates the need for engineered text prompts and improves distillation of generic models into streamlined specialized networks.

On Good Practices for Task-Specific Distillation of Large Pretrained Visual Models

TL;DR

The paper tackles how to transfer knowledge from large pretrained visual models to compact, task-specific students in resource-constrained settings. It proposes a two-step pipeline: probe a frozen teacher with a task head to form , then distill to a smaller using a combined loss , where is a KL-based objective with temperature , evaluated over an augmented dataset that includes synthetic images generated by Stable Diffusion via ImageMixer. A key contribution is demonstrating that a probed, potentially less accurate teacher often yields better distillation than finetuning the teacher, and that synthetic data for distillation—used only in —significantly improves performance across classification, fine-grained classification, and segmentation. The work extends these findings to different architectures and even models trained with different paradigms (e.g., EVA-02), providing practical guidelines for distillation that avoid expensive fine-tuning and leverage prompt-free diffusion-based augmentation. Overall, the paper offers a robust, scalable framework for task-specific distillation with clear empirical benefits and broad applicability.

Abstract

Large pretrained visual models exhibit remarkable generalization across diverse recognition tasks. Yet, real-world applications often demand compact models tailored to specific problems. Variants of knowledge distillation have been devised for such a purpose, enabling task-specific compact models (the students) to learn from a generic large pretrained one (the teacher). In this paper, we show that the excellent robustness and versatility of recent pretrained models challenge common practices established in the literature, calling for a new set of optimal guidelines for task-specific distillation. To address the lack of samples in downstream tasks, we also show that a variant of Mixup based on stable diffusion complements standard data augmentation. This strategy eliminates the need for engineered text prompts and improves distillation of generic models into streamlined specialized networks.
Paper Structure (28 sections, 4 equations, 5 figures, 15 tables)

This paper contains 28 sections, 4 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: This paper advocates for distilling a large pretrained teacher (top, left) to train a small task-specific student model (top, right). This distillation process results in a better clustering of the representations compared to simply finetuning the student on the task (bottom, right). Distillation is improved by a class-agnostic data augmentation based on Stable Diffusion that consists in mixing real images to create synthetic ones, producing features shown in gray in the teacher plot. Each plot shows image features for 30 classes of the CUB Bird dataset, after PCA (one color per class).
  • Figure 2: Overview of the task-specific distillation pipeline. The pretrained model is probed to build a teacher (top). Then its knowledge is distilled hinton2015distilling by minimizing the distillation loss $\mathcal{L}_{\text{distill}}$ jointly with the task loss $\mathcal{L}_{\text{task}}$ (bottom). $\mathcal{L}_{\text{distill}}$ is optimized with both i) original images $x$ and ii) synthetic images obtained with Stable Diffusion $x'$, while $\mathcal{L}_{\text{task}}$ is only optimized on the original dataset $(x,y)$. Note that $x$ and $x'$ are also transformed using standard data augmentation (not shown here).
  • Figure 3: Diffusion-based data augmentation. Examples of synthetic images generated using ImageMixer imagemixer as described in \ref{['sec:da_sd']}, mixing two training images from CUB WahCUB_200_2011 (left), Pascal VOC everingham10voc (middle) and Painting from DomainNet peng2019domainnet (right). Those populate the extended dataset $\mathcal{D}_{\text{sd}}$ for distillation.
  • Figure 4: PCA of patch embedding representations for 20 classes of ADE20K for the ViT-g teacher (a) and for the ViT-S student in its initial state (b), after finetuning (c) and after distillation (d), colored by their main class (details in the appendix). Classes are better clustered after distillation than after finetuning.
  • Figure A: Diffusion-based data augmentation. Examples of synthetic images generated using ImageMixer imagemixer as described in \ref{['sec:da_sd']}, mixing two training images from ADE20K zhou2017ade20k (left), Sketch from DomainNet peng2019domainnet (middle) and DTD cimpoi14dtd (right). Those populate the extended dataset $\mathcal{D}_{\text{sd}}$ used for distillation.