PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

Zheng Li; Xiang Li; Xinyi Fu; Xin Zhang; Weiqiang Wang; Shuo Chen; Jian Yang

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, Jian Yang

TL;DR

PromptKD tackles domain-specific generalization in vision-language models by enabling unsupervised prompt distillation from a large CLIP teacher to a lightweight student. The method operates in two stages: first pre-trains a teacher on domain data and saves the teacher's text features as class vectors, then distills knowledge onto a student using unlabeled domain images with learnable prompts and a projector, aligning logits via KL divergence $L_{kd}(q^t,q^s,\tau)=\tau^2 \mathrm{KL}(\sigma(q^t/\tau),\sigma(q^s/\tau))$. By reusing pre-stored class vectors, PromptKD avoids text-branch computation during distillation and inference, achieving competitive performance with lower inference cost. Across 11 datasets and both base/novel and cross-dataset settings, PromptKD delivers strong HM gains over baselines, demonstrating practical effectiveness in label-scarce, domain-shifted scenarios.

Abstract

Prompt learning has emerged as a valuable technique in enhancing vision-language models (VLMs) such as CLIP for downstream tasks in specific domains. Existing work mainly focuses on designing various learning forms of prompts, neglecting the potential of prompts as effective distillers for learning from larger teacher models. In this paper, we introduce an unsupervised domain prompt distillation framework, which aims to transfer the knowledge of a larger teacher model to a lightweight target model through prompt-driven imitation using unlabeled domain images. Specifically, our framework consists of two distinct stages. In the initial stage, we pre-train a large CLIP teacher model using domain (few-shot) labels. After pre-training, we leverage the unique decoupled-modality characteristics of CLIP by pre-computing and storing the text features as class vectors only once through the teacher text encoder. In the subsequent stage, the stored class vectors are shared across teacher and student image encoders for calculating the predicted logits. Further, we align the logits of both the teacher and student models via KL divergence, encouraging the student image encoder to generate similar probability distributions to the teacher through the learnable prompts. The proposed prompt distillation process eliminates the reliance on labeled data, enabling the algorithm to leverage a vast amount of unlabeled images within the domain. Finally, the well-trained student image encoders and pre-stored text features (class vectors) are utilized for inference. To our best knowledge, we are the first to (1) perform unsupervised domain-specific prompt-driven knowledge distillation for CLIP, and (2) establish a practical pre-storing mechanism of text features as shared class vectors between teacher and student. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

TL;DR

. By reusing pre-stored class vectors, PromptKD avoids text-branch computation during distillation and inference, achieving competitive performance with lower inference cost. Across 11 datasets and both base/novel and cross-dataset settings, PromptKD delivers strong HM gains over baselines, demonstrating practical effectiveness in label-scarce, domain-shifted scenarios.

Abstract

Paper Structure (15 sections, 3 equations, 6 figures, 14 tables, 1 algorithm)

This paper contains 15 sections, 3 equations, 6 figures, 14 tables, 1 algorithm.

Introduction
Related Work
Method
Background
PromptKD: Prompt Distillation for VLMs
Experiments
Settings
Base-to-novel Generalization
Cross-dataset Evaluation
Comparison with Other Methods
Ablation Study
Conclusion
Experimental Settings
Additional Experiments
Discussion

Figures (6)

Figure 1: Harmonic mean (HM) comparison on base-to-novel generalization. All methods adopt the ViT-B/16 image encoder from the pre-trained CLIP model. PromptKD achieves state-of-the-art performance on 11 diverse recognition datasets.
Figure 2: Architecture comparison between classic KD paradigm for CLIP (likewise CLIP-KD yang2023clip) and our prompt distillation framework. (a) Classic KD methods perform distillation between independent teacher and student models. Students are typically fully fine-tuned by teachers' soft labels. (b) PromptKD breaks the rules of teacher-student independence. We propose to reuse the previously well-trained text features from the teacher pre-training stage and incorporate them into the student image encoder for both distillation and inference.
Figure 3: An overview of our proposed prompt distillation (PromptKD) framework. (a) We first pre-train a large CLIP teacher model using existing state-of-the-art prompt learning methods with labeled training images. Then we save the well-trained text features of all possible classes for the next stages. (b) During the distillation stage, the training is focused on student image prompts and the project layer, and there are no extra computational expenses associated with the text encoding process when utilizing the pre-saved text features as class vectors. (c) Finally, the well-trained student and pre-stored class vectors are utilized for inference.
Figure 4: Improved ImageNet classification accuracy of the student model with increasing numbers of unlabeled images per category used for distillation.
Figure 5: Comparison of distillation results for teachers with different capacities. Better teachers lead to better distillation performance.
...and 1 more figures

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

TL;DR

Abstract

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)