DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

Zhengxu He; Jun Li; Zhijian Wu

DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

Zhengxu He, Jun Li, Zhijian Wu

Abstract

Large-scale Vision-Language Models (VLMs) encode rich multimodal semantics that are highly beneficial for fine-grained visual categorization (FGVC). However, their prohibitive computational cost hinders practical deployment in resource-constrained environments. Although knowledge distillation contributes to transferring VLMs capacity to lightweight classifiers, conventional distillation mechanisms, which directly transfer from a generic VLM to a compact student, often yield suboptimal results due to severe architectural misalignment and introducing task-irrelevant information. To alleviate this limitation, we propose Distillation with Adaptive Intermediate Teacher transfer (DAIT) in this study, facilitating adaptive knowledge transfer from VLMs to lightweight students. DAIT introduces a trainable intermediate teacher that learns to transfer frozen VLMs representations under explicit supervision from the target fine-grained task. This intermediate teacher adaptively enhances discriminative visual cues, thereby producing compact and task-aligned knowledge that can be reliably distilled into lightweight models. Extensive evaluations on multiple FGVC benchmarks with diverse student architectures demonstrate that our method achieves respective performance gains of 12.63% and 8.34% on FGVC-Aircraft and CUB-200-2011 datasets, establishing DAIT as a principled paradigm for transferring from general-purpose VLMS to deployable fine-grained recognition models.

DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

Abstract

Paper Structure (25 sections, 14 equations, 7 figures, 3 tables)

This paper contains 25 sections, 14 equations, 7 figures, 3 tables.

Introduction
Related Work
Fine-Grained Visual Categorization
Knowledge Distillation
Method
Distillation from VLM to Intermediate Teacher
Semantic Image Alignment Loss.
Image Representation Alignment Loss.
Classification Loss.
Overall Loss.
Distillation from Intermediate Teacher to Lightweight Model
Spatial Representation Alignment Loss.
Classification Loss.
Overall Loss.
Comparison with Prior Works
...and 10 more sections

Figures (7)

Figure 1: Failure cases of direct distillation from VLM to lightweight student network ResNet18. (a) Direct distillation from the VLM. (b) Independently trained student without teacher guidance. (c) Our method: DAIT. Heatmaps are overlaid on the input images to indicate model attention. Compared to (a) and (b), our method yields more concentrated activations on discriminative parts (e.g., the bird's body) while suppressing background responses.
Figure 2: Overview of our proposed DAIT for distilling from a frozen VLM to a lightweight model via a trainable intermediate teacher model. With the help of data augmentation, the intermediate teacher captures rich fine-grained cues from the VLM outputs and performs adaptive knowledge transfer. It then processes original image features to produce refined and task-aligned supervision, which is transferred to the lightweight student by feature-level distillation.
Figure 3: Visualization of intra-class and inter-class feature distributions of ResNet-18 distilled by different methods on the Stanford Cars dataset. They are illustrated via similarity matrices and t-SNE plots. With the guidance of an intermediate teacher, our method enables ResNet-18 to attain stronger discriminative capability.
Figure 4: Bubble visualization of parameter counts for different deep architectures. The number below each bubble denotes the model size in millions (M). Unlike prior works that usually use a network larger than the student as a teacher assistant, our DAIT adopts even smaller than the student model as the intermediate teacher.
Figure 5: The top-1 accuracy of ResNet18 on FGVC-Aircraft using different intermediate teachers for knowledge distillation. The dashed line represents the baseline result (55.87%) obtained by direct distillation from the VLM.
...and 2 more figures

DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

Abstract

DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

Authors

Abstract

Table of Contents

Figures (7)