AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

Jan Hendrik Metzen; Piyapat Saranrittichai; Chaithanya Kumar Mummadi

AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

Jan Hendrik Metzen, Piyapat Saranrittichai, Chaithanya Kumar Mummadi

TL;DR

AutoCLIP addresses the inefficiency of treating all prompt templates equally in zero-shot vision-language classification by learning per-image template weights $w_i$ via a gradient step on a $\log \mathrm{sumexp}$ objective, with weights derived as $w = \mathrm{softmax}(\rho)$. This test-time adaptation operates entirely in the embedding space, avoiding gradient propagation through encoders and enabling ultra-lightweight inference using a fixed set of templates. A entropy-based scheme tunes the step size implicitly through a target entropy $\beta \log_2 K$, reducing hyperparameter tuning in zero-shot settings. Across diverse datasets, models, and template families, AutoCLIP yields consistent accuracy gains (average around 0.45pp, up to 3pp in some cases) with modest overhead, demonstrating its utility as a default zero-shot inference strategy for vision-language models.

Abstract

Classifiers built upon vision-language models such as CLIP have shown remarkable zero-shot performance across a broad range of image classification tasks. Prior work has studied different ways of automatically creating descriptor sets for every class based on prompt templates, ranging from manually engineered templates over templates obtained from a large language model to templates built from random words and characters. Up until now, deriving zero-shot classifiers from the respective encoded class descriptors has remained nearly unchanged, i.e., classify to the class that maximizes cosine similarity between its averaged encoded class descriptors and the image encoding. However, weighing all class descriptors equally can be suboptimal when certain descriptors match visual clues on a given image better than others. In this work, we propose AutoCLIP, a method for auto-tuning zero-shot classifiers. AutoCLIP tunes per-image weights to each prompt template at inference time, based on statistics of class descriptor-image similarities. AutoCLIP is fully unsupervised, has only a minor additional computation overhead, and can be easily implemented in few lines of code. We show that AutoCLIP outperforms baselines across a broad range of vision-language models, datasets, and prompt templates consistently and by up to 3 percent point accuracy.

AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

TL;DR

AutoCLIP addresses the inefficiency of treating all prompt templates equally in zero-shot vision-language classification by learning per-image template weights

via a gradient step on a

objective, with weights derived as

. This test-time adaptation operates entirely in the embedding space, avoiding gradient propagation through encoders and enabling ultra-lightweight inference using a fixed set of templates. A entropy-based scheme tunes the step size implicitly through a target entropy

, reducing hyperparameter tuning in zero-shot settings. Across diverse datasets, models, and template families, AutoCLIP yields consistent accuracy gains (average around 0.45pp, up to 3pp in some cases) with modest overhead, demonstrating its utility as a default zero-shot inference strategy for vision-language models.

Abstract

Paper Structure (20 sections, 12 figures, 4 tables, 2 algorithms)

This paper contains 20 sections, 12 figures, 4 tables, 2 algorithms.

Introduction
Related Work
AutoCLIP
Background: Zero-Shot Classifiers for Vision-Language Models
Auto-Tuning Zero-Shot Classfiers
Closed-form Computation of Gradient
Auto-Tuning the Step Size
Experiments
Experimental Setting
Results
Ablations
Analysis in a Controlled Setting
Conclusion
Appendix
Inference time overhead of AutoCLIP
...and 5 more sections

Figures (12)

Figure 1: Conceptual Illustration of AutoCLIP. CLIP's zero-shot classifiers are based on a set of prompt templates $t_i$ ("A photo of a $<$class_name$>$", "A drawing of a $<$class_name$>$", ...). Inserting class names $c$ into these templates gives a set of class descriptors that are encoded into a joint embedding space together with the respective image. Standard CLIP averages encoded class descriptors $q_i(c)$ into class queries $q_c$, and classifies to the class that has maximal cosine similarity with the encoded image. However, this ignores that some prompt templates describe the image of interest better than others (their embeddings have higher average similarity): for instance, when the image is a drawing, the template "A drawing of a $<$class_name$>$" results in stronger class descriptors than other templates and should thus be weighted higher when computing class queries. AutoCLIP determines such weights directly from class descriptor-image similarities in the embedding space. Here, the car image is taken from atkinson2015car.
Figure 2: Accuracy improvement ($\Delta$ Accuracy) of AutoCLIP over baseline zero-shot classifier across models, datasets, and prompt ensembles. Shown are mean and standard error over 7 runs.
Figure 3: ImageNet-C accuracy improvement ($\Delta$ Accuracy) of AutoCLIP over baseline zero-shot classifier for $K=100$ across models, corruption severity and prompt ensembles, averaged over corruptions and 7 runs.
Figure 4: Ablation on target entropy rate $\beta$. Shown is the accuracy improvement ($\Delta$ Accuracy) of AutoCLIP over baseline zero-shot classifier for a CLIP ViT-B-16, and 100 WaffleCLIP prompt templates, averaged over 7 runs.
Figure 5: Comparison of different objective functions for auto-tuning. Shown is the accuracy improvement ($\Delta$ Accuracy) of AutoCLIP over baseline zero-shot classifier for a ViT-B-16, and 100 WaffleCLIP prompt templates, averaged over 7 runs.
...and 7 more figures

AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

TL;DR

Abstract

AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)