Table of Contents
Fetching ...

When Better Teachers Don't Make Better Students: Revisiting Knowledge Distillation for CLIP Models in VQA

Pume Tuchinda, Parinthapat Pengpun, Romrawin Chumpu, Sarana Nutanong, Peerat Limkonchotiwat

TL;DR

This work interrogates the effectiveness of knowledge distillation from CLIP-style teachers for vision-language models, revealing that stronger teachers do not consistently yield better multimodal students and that scaling KD in this domain faces alignment bottlenecks. Through systematic experiments across teacher/student scales, loss functions, training durations, and data sources, the authors identify representational misalignment and task-specific limitations as key barriers to transferring large-capacity teachers to VLMs. They show substantial parameter-efficiency can be achieved (e.g., reducing vision encoder parameters from $85.8\mathrm{M}$ to $5.5\mathrm{M}$ with only modest multimodal drops), but enhancements do not reliably translate to VQA or related multimodal tasks. The findings argue for redesigned KD objectives and data-alignment strategies tailored to multimodal settings, guiding future methods toward truly parameter-efficient and high-performing VLMs.

Abstract

Vision-language models (VLMs) have achieved remarkable success across multimodal tasks, yet their substantial computational demands hinder efficient deployment. Knowledge distillation (KD) has emerged as a powerful approach for building lightweight but competitive models, with strong evidence from both language and vision domains. However, its application to VLMs, particularly CLIP-style models, remains limited, often constrained to small-scale teachers and narrow evaluation tasks such as classification or retrieval. In this work, we present the first systematic study of distillation across a range of CLIP-style teacher models, ranging from standard baselines to large-scale state-of-the-art models. Contrary to trends observed in NLP and vision, we find that stronger teachers do not consistently yield better students; in fact, existing distillation frameworks often fail to scale, leading to degraded performance in downstream multimodal tasks such as visual question answering. Our findings challenge prevailing assumptions in KD and point toward new directions for designing parameter-efficient multimodal models.

When Better Teachers Don't Make Better Students: Revisiting Knowledge Distillation for CLIP Models in VQA

TL;DR

This work interrogates the effectiveness of knowledge distillation from CLIP-style teachers for vision-language models, revealing that stronger teachers do not consistently yield better multimodal students and that scaling KD in this domain faces alignment bottlenecks. Through systematic experiments across teacher/student scales, loss functions, training durations, and data sources, the authors identify representational misalignment and task-specific limitations as key barriers to transferring large-capacity teachers to VLMs. They show substantial parameter-efficiency can be achieved (e.g., reducing vision encoder parameters from to with only modest multimodal drops), but enhancements do not reliably translate to VQA or related multimodal tasks. The findings argue for redesigned KD objectives and data-alignment strategies tailored to multimodal settings, guiding future methods toward truly parameter-efficient and high-performing VLMs.

Abstract

Vision-language models (VLMs) have achieved remarkable success across multimodal tasks, yet their substantial computational demands hinder efficient deployment. Knowledge distillation (KD) has emerged as a powerful approach for building lightweight but competitive models, with strong evidence from both language and vision domains. However, its application to VLMs, particularly CLIP-style models, remains limited, often constrained to small-scale teachers and narrow evaluation tasks such as classification or retrieval. In this work, we present the first systematic study of distillation across a range of CLIP-style teacher models, ranging from standard baselines to large-scale state-of-the-art models. Contrary to trends observed in NLP and vision, we find that stronger teachers do not consistently yield better students; in fact, existing distillation frameworks often fail to scale, leading to degraded performance in downstream multimodal tasks such as visual question answering. Our findings challenge prevailing assumptions in KD and point toward new directions for designing parameter-efficient multimodal models.

Paper Structure

This paper contains 17 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Conceptual overview of the framework. The study investigates why existing CLIP-style knowledge distillation methods, despite success in unimodal settings, fail to scale effectively to stronger teachers in multimodal tasks. The study is structured around two questions: (RQ1) knowledge transfer effectiveness in vision encoders, and (RQ2) the challenges of applying distilled encoders within VLMs.
  • Figure 2: Multimodal performance difference between the various teacher models and the distilled ViT-T as the student model.
  • Figure 3: ImageNet and Multimodal performance when distilling the student model (ViT-T) for longer on the CLIP-KD setup. ImageNet accuracy increases steadily with more training epochs (default: 32), whereas Multimodal performance shows only marginal improvements. CKA similarity rises with ImageNet training but remains relatively constant for Multimodal tasks. Results for the Multimodal benchmarks, we take the checkpoints from each epoch and train the vision encoder with the TinyLlaVA framework with Qwen2-0.5B as the LLM.