Table of Contents
Fetching ...

DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

Zhengxu He, Jun Li, Zhijian Wu

Abstract

Large-scale Vision-Language Models (VLMs) encode rich multimodal semantics that are highly beneficial for fine-grained visual categorization (FGVC). However, their prohibitive computational cost hinders practical deployment in resource-constrained environments. Although knowledge distillation contributes to transferring VLMs capacity to lightweight classifiers, conventional distillation mechanisms, which directly transfer from a generic VLM to a compact student, often yield suboptimal results due to severe architectural misalignment and introducing task-irrelevant information. To alleviate this limitation, we propose Distillation with Adaptive Intermediate Teacher transfer (DAIT) in this study, facilitating adaptive knowledge transfer from VLMs to lightweight students. DAIT introduces a trainable intermediate teacher that learns to transfer frozen VLMs representations under explicit supervision from the target fine-grained task. This intermediate teacher adaptively enhances discriminative visual cues, thereby producing compact and task-aligned knowledge that can be reliably distilled into lightweight models. Extensive evaluations on multiple FGVC benchmarks with diverse student architectures demonstrate that our method achieves respective performance gains of 12.63% and 8.34% on FGVC-Aircraft and CUB-200-2011 datasets, establishing DAIT as a principled paradigm for transferring from general-purpose VLMS to deployable fine-grained recognition models.

DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

Abstract

Large-scale Vision-Language Models (VLMs) encode rich multimodal semantics that are highly beneficial for fine-grained visual categorization (FGVC). However, their prohibitive computational cost hinders practical deployment in resource-constrained environments. Although knowledge distillation contributes to transferring VLMs capacity to lightweight classifiers, conventional distillation mechanisms, which directly transfer from a generic VLM to a compact student, often yield suboptimal results due to severe architectural misalignment and introducing task-irrelevant information. To alleviate this limitation, we propose Distillation with Adaptive Intermediate Teacher transfer (DAIT) in this study, facilitating adaptive knowledge transfer from VLMs to lightweight students. DAIT introduces a trainable intermediate teacher that learns to transfer frozen VLMs representations under explicit supervision from the target fine-grained task. This intermediate teacher adaptively enhances discriminative visual cues, thereby producing compact and task-aligned knowledge that can be reliably distilled into lightweight models. Extensive evaluations on multiple FGVC benchmarks with diverse student architectures demonstrate that our method achieves respective performance gains of 12.63% and 8.34% on FGVC-Aircraft and CUB-200-2011 datasets, establishing DAIT as a principled paradigm for transferring from general-purpose VLMS to deployable fine-grained recognition models.
Paper Structure (25 sections, 14 equations, 7 figures, 3 tables)

This paper contains 25 sections, 14 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Failure cases of direct distillation from VLM to lightweight student network ResNet18. (a) Direct distillation from the VLM. (b) Independently trained student without teacher guidance. (c) Our method: DAIT. Heatmaps are overlaid on the input images to indicate model attention. Compared to (a) and (b), our method yields more concentrated activations on discriminative parts (e.g., the bird's body) while suppressing background responses.
  • Figure 2: Overview of our proposed DAIT for distilling from a frozen VLM to a lightweight model via a trainable intermediate teacher model. With the help of data augmentation, the intermediate teacher captures rich fine-grained cues from the VLM outputs and performs adaptive knowledge transfer. It then processes original image features to produce refined and task-aligned supervision, which is transferred to the lightweight student by feature-level distillation.
  • Figure 3: Visualization of intra-class and inter-class feature distributions of ResNet-18 distilled by different methods on the Stanford Cars dataset. They are illustrated via similarity matrices and t-SNE plots. With the guidance of an intermediate teacher, our method enables ResNet-18 to attain stronger discriminative capability.
  • Figure 4: Bubble visualization of parameter counts for different deep architectures. The number below each bubble denotes the model size in millions (M). Unlike prior works that usually use a network larger than the student as a teacher assistant, our DAIT adopts even smaller than the student model as the intermediate teacher.
  • Figure 5: The top-1 accuracy of ResNet18 on FGVC-Aircraft using different intermediate teachers for knowledge distillation. The dashed line represents the baseline result (55.87%) obtained by direct distillation from the VLM.
  • ...and 2 more figures