Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification

Yunyi Xuan; Weijie Chen; Shicai Yang; Di Xie; Luojun Lin; Yueting Zhuang

Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification

Yunyi Xuan, Weijie Chen, Shicai Yang, Di Xie, Luojun Lin, Yueting Zhuang

TL;DR

This work tackles the challenge of data-free knowledge distillation (DFKD) under distribution shifts by leveraging Vision-Language Foundation Models (VLFM), exemplified by CLIP. It introduces DFKD-VLFM, a pipeline that synthesizes surrogate images via a VQGAN-CLIP framework guided by diversified text prompts, and then distills knowledge from CLIP to a lightweight student without real data. The authors propose three prompt diversification strategies—Mix-Prompt, Random-Prompt, and Contrastive-Prompt—to widen the implicit data distribution captured by the prompts, with Contrastive-Prompt delivering the strongest gains on domain-generalization benchmarks (PACS, VLCS, ImageCLEF-DA, VisDA) in zero-shot and few-shot settings. The results demonstrate that large pre-trained VLMs can provide robust, transferable supervision for compact models without data access, enabling effective edge deployment and broad downstream applicability.

Abstract

Data-Free Knowledge Distillation (DFKD) has shown great potential in creating a compact student model while alleviating the dependency on real training data by synthesizing surrogate data. However, prior arts are seldom discussed under distribution shifts, which may be vulnerable in real-world applications. Recent Vision-Language Foundation Models, e.g., CLIP, have demonstrated remarkable performance in zero-shot out-of-distribution generalization, yet consuming heavy computation resources. In this paper, we discuss the extension of DFKD to Vision-Language Foundation Models without access to the billion-level image-text datasets. The objective is to customize a student model for distribution-agnostic downstream tasks with given category concepts, inheriting the out-of-distribution generalization capability from the pre-trained foundation models. In order to avoid generalization degradation, the primary challenge of this task lies in synthesizing diverse surrogate images driven by text prompts. Since not only category concepts but also style information are encoded in text prompts, we propose three novel Prompt Diversification methods to encourage image synthesis with diverse styles, namely Mix-Prompt, Random-Prompt, and Contrastive-Prompt. Experiments on out-of-distribution generalization datasets demonstrate the effectiveness of the proposed methods, with Contrastive-Prompt performing the best.

Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification

TL;DR

Abstract

Paper Structure (35 sections, 7 equations, 11 figures, 8 tables)

This paper contains 35 sections, 7 equations, 11 figures, 8 tables.

Introduction
Related Works
Vision-Language Foundation Models
Data-Free Knowledge Distillation
Method
Preliminary
Prompt Diversification
Mix-Prompt
Random-Prompt
Contrastive-Prompt
Experiments
Experiment Settings
Datasets
Implementation details
Zero-shot Classification
...and 20 more sections

Figures (11)

Figure 1: Synthesize, distill, and then generalize. Here we take CLIP as an example, including an image encoder $\mathcal{E}_{img}$ and a text encoder$\mathcal{E}_{txt}$. The surrogate images are first synthesized from the foundation model with several words of the target categories [ CLS1, ..., CLSn] provided. Knowledge distillation is then performed upon the synthesized dataset to customize a generalizable student, wherein CLIP acts as Teacher.
Figure 2: A comparison of surrogate image synthesis. top: "person", middle: "elephant", bottom: "house". Directly using VQGAN-CLIP suffers from model collapse. Contrastive-Prompt can significantly increase the data diversity with complex contexts, facilitating the process of Data-Free Knowledge Distillation from Vision-Language Foundation Models.
Figure 3: Conventional DFKD aims to diversify images directly. In contrast, in the context of vision-language foundation models, we aim to diversify text prompts as a bridge to synthesize diverse surrogate images since the style information can be encoded in the text prompts implicitly. Here $m$ denotes the category number while $n$ is the sample number.
Figure 4: Three Prompt Diversification methods, Mix-Prompt, Random-Prompt, and Contrastive-Prompt, are utilized to generate diverse text prompts $T_m$, resulting in diverse task-specific high-fidelity images. With these surrogate training images, we can customize a task-specific student model ($\theta_S$) by extracting knowledge from CLIP ($\mathcal{E}_{img}$ + $\mathcal{E}_{txt}$).
Figure 5: Results of few-shot fine-tuning on three datasets. "Num." denotes the shots for training. Contrastive-Prompt drives the crafted student into a strong few-shot learner (blue lines), transcending the pre-trained model on ImageNet (orange lines).
...and 6 more figures

Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification

TL;DR

Abstract

Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification

Authors

TL;DR

Abstract

Table of Contents

Figures (11)