Table of Contents
Fetching ...

Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation

Marco Mistretta, Alberto Baldrati, Marco Bertini, Andrew D. Bagdanov

TL;DR

KDPL addresses the zero-shot generalization gap in vision–language prompting by distilling knowledge from a large teacher into a lightweight student, without using labeled data. It integrates with existing prompt-learning methods and supports label-agnostic and class-agnostic adaptation by aligning teacher probabilities $p_T$ and student predictions $p_S$ through a symmetric KL loss. The approach is validated across domain generalization, cross-dataset transfer, and unseen-class generalization on 10+ datasets, showing consistent gains over baselines and competitive results in class-agnostic settings. KDPL reduces the reliance on labels for adaptation and demonstrates transferability across backbones and prompting strategies, with a practical compute overhead that may further decrease with tuning or larger teachers.

Abstract

Vision-Language Models (VLMs) demonstrate remarkable zero-shot generalization to unseen tasks, but fall short of the performance of supervised methods in generalizing to downstream tasks with limited data. Prompt learning is emerging as a parameter-efficient method for adapting VLMs, but state-of-the-art approaches require annotated samples. In this paper we propose a novel approach to prompt learning based on unsupervised knowledge distillation from more powerful models. Our approach, which we call Knowledge Distillation Prompt Learning (KDPL), can be integrated into existing prompt learning techniques and eliminates the need for labeled examples during adaptation. Our experiments on more than ten standard benchmark datasets demonstrate that KDPL is very effective at improving generalization of learned prompts for zero-shot domain generalization, zero-shot cross-dataset generalization, and zero-shot base-to-novel class generalization problems. KDPL requires no ground-truth labels for adaptation, and moreover we show that even in the absence of any knowledge of training class names it can be used to effectively transfer knowledge. The code is publicly available at https://github.com/miccunifi/KDPL.

Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation

TL;DR

KDPL addresses the zero-shot generalization gap in vision–language prompting by distilling knowledge from a large teacher into a lightweight student, without using labeled data. It integrates with existing prompt-learning methods and supports label-agnostic and class-agnostic adaptation by aligning teacher probabilities and student predictions through a symmetric KL loss. The approach is validated across domain generalization, cross-dataset transfer, and unseen-class generalization on 10+ datasets, showing consistent gains over baselines and competitive results in class-agnostic settings. KDPL reduces the reliance on labels for adaptation and demonstrates transferability across backbones and prompting strategies, with a practical compute overhead that may further decrease with tuning or larger teachers.

Abstract

Vision-Language Models (VLMs) demonstrate remarkable zero-shot generalization to unseen tasks, but fall short of the performance of supervised methods in generalizing to downstream tasks with limited data. Prompt learning is emerging as a parameter-efficient method for adapting VLMs, but state-of-the-art approaches require annotated samples. In this paper we propose a novel approach to prompt learning based on unsupervised knowledge distillation from more powerful models. Our approach, which we call Knowledge Distillation Prompt Learning (KDPL), can be integrated into existing prompt learning techniques and eliminates the need for labeled examples during adaptation. Our experiments on more than ten standard benchmark datasets demonstrate that KDPL is very effective at improving generalization of learned prompts for zero-shot domain generalization, zero-shot cross-dataset generalization, and zero-shot base-to-novel class generalization problems. KDPL requires no ground-truth labels for adaptation, and moreover we show that even in the absence of any knowledge of training class names it can be used to effectively transfer knowledge. The code is publicly available at https://github.com/miccunifi/KDPL.
Paper Structure (13 sections, 3 equations, 4 figures, 9 tables)

This paper contains 13 sections, 3 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Motivation and overview. (Top left) Lightweight VLMs like CLIP achieve impressive zero-shot performance but lag behind supervised approaches; large VLMs incur a high computational burden. (Bottom left) Parameter-efficient prompt learning offers a non-destructive approach to adapting VLMs to downstream tasks; however, existing methods require annotated samples and struggle to generalize to unseen classes. (Right) Our approach does not require labeled samples and learns by distilling knowledge from a more powerful VLM. It can be seamlessly integrated into existing prompt learning techniques and generalizes better to unseen classes on downstream tasks.
  • Figure 2: Knowledge Distillation Prompt Learning (KDPL) overview. Given a lightweight VLM student and a larger, more powerful VLM teacher, KDPL updates the student prompt parameters by distilling knowledge from the teacher. KDPL first performs zero-shot classification with the teacher to obtain teacher probabilities $p_T$. It then computes the student probabilities $p_S$ and performs knowledge distillation to update the student prompt parameters $\gamma$.
  • Figure 3: Class agnostic adaptation. (a) Comparison between the zero-shot baselines and our class agnostic CA-KDPL variants (highlighted in cyan). The prompt is learned in an unsupervised and class agnostic setting on ImageNet and evaluated on the benchmark datasets for domain generalization ($\text{AVG}^1$) and cross-dataset ($\text{AVG}^2$) evaluations. Average performance improvements are indicated in green, and deterioration in red. (b) Per-dataset accuracy comparison between our unsupervised and class agnostic method (CoOp+CA-KDPL), the supervised baseline CoOp, and the zero-shot student on the cross-dataset benchmark datasets.
  • Figure 4: Generalization to Unseen Classes. The base-to-unseen generalization results of MaPLe+KDPL versus MaPLe are reported. Prompts are learned from the base classes and evaluated on the same base classes and on the unseen classes of the test set. Instead of reporting only the 16-shot performance, we include results for 1-, 2-, 4-, 8-, and 16-shots. Here we see that increasing the number of shots increases the performance of the base class but harms the performance on unseen classes. However, with our approach we can alleviate this problem, reaching an average 16-shots performance on unseen classes greater than the zero-shot reference, while the average MaPLe performance remains below the average zero-shot reference for all shots.