Frustratingly Easy Test-Time Adaptation of Vision-Language Models

Matteo Farina; Gianni Franchi; Giovanni Iacca; Massimiliano Mancini; Elisa Ricci

Frustratingly Easy Test-Time Adaptation of Vision-Language Models

Matteo Farina, Gianni Franchi, Giovanni Iacca, Massimiliano Mancini, Elisa Ricci

TL;DR

This work theoretically investigates the properties of this approach ZERO (TTA with"zero"temperature), whose design is both incredibly effective and frustratingly simple: augment N times, predict, retain the most confident predictions, and marginalize after setting the Softmax temperature to zero.

Abstract

Vision-Language Models seamlessly discriminate among arbitrary semantic categories, yet they still suffer from poor generalization when presented with challenging examples. For this reason, Episodic Test-Time Adaptation (TTA) strategies have recently emerged as powerful techniques to adapt VLMs in the presence of a single unlabeled image. The recent literature on TTA is dominated by the paradigm of prompt tuning by Marginal Entropy Minimization, which, relying on online backpropagation, inevitably slows down inference while increasing memory. In this work, we theoretically investigate the properties of this approach and unveil that a surprisingly strong TTA method lies dormant and hidden within it. We term this approach ZERO (TTA with "zero" temperature), whose design is both incredibly effective and frustratingly simple: augment N times, predict, retain the most confident predictions, and marginalize after setting the Softmax temperature to zero. Remarkably, ZERO requires a single batched forward pass through the vision encoder only and no backward passes. We thoroughly evaluate our approach following the experimental protocol established in the literature and show that ZERO largely surpasses or compares favorably w.r.t. the state-of-the-art while being almost 10x faster and 13x more memory-friendly than standard Test-Time Prompt Tuning. Thanks to its simplicity and comparatively negligible computation, ZERO can serve as a strong baseline for future work in this field. The code is available at https://github.com/FarinaMatteo/zero.

Frustratingly Easy Test-Time Adaptation of Vision-Language Models

TL;DR

Abstract

Paper Structure (30 sections, 1 theorem, 24 equations, 8 figures, 11 tables, 1 algorithm)

This paper contains 30 sections, 1 theorem, 24 equations, 8 figures, 11 tables, 1 algorithm.

Introduction
Understanding Marginal Entropy Minimization
Preliminaries
How does MEM affect the marginal probability distribution?
How does $\overline{p}$ relate to the standard inference protocol?
Simple and surprisingly strong TTA (for free)
Augmentations undermine the reliability of $\overline{p}$
Zero: Test-Time Adaptation with "zero" temperature
Experiments
Experimental Protocol
Results
Related Work
Limitations
Conclusions
Acknowledgements.
...and 15 more sections

Key Result

Proposition 2.1

Let $\mathbf{z}_1^{\text{img}}, \ldots, \mathbf{z}_N^{\text{img}}$ be the latent image representations resulting from the $N$ views and $\hat{c} = \mathop{\mathrm{arg\,max}}\limits \overline{p}(\cdot|\mathbf{x}, \boldsymbol{\mathbf{t}_{\text{ctx}}})$ be the initial prediction of the marginal probabi

Figures (8)

Figure 1: Motivating findings. (a) Comparison between the expected error of CLIP-ViT-B-16, denoted as $\epsilon(y)$, and the error of the marginal probability distribution obtained by marginalizing over examples with the same label, $P_{\hat{y}\neq y}(\overline{p})$; (b) Reliability diagrams of CLIP-ViT-B-16 on the ImageNet validation set (left), and its augmented version (right), showing that augmentations largely un-calibrate CLIP exclusively due to overconfidence while leading to slightly better overall accuracy.
Figure 2: Entropy of the pre-TTA marginal probability distribution vs the invariance ratio.
Figure 3: Expected Calibration Error (ECE) guo2017calibration of CLIP-ViT-B-16 across 5 datasets for robustness to natural distribution shifts. Blue is the ECE of zero-shot CLIP, and orange is the ECE of zero-shot CLIP on an augmented version of the dataset after confidence-based thresholding.
Figure 4: Reliability diagrams (20 bins) for CLIP-ViT-B-16 on the 4 datasets for Natural Distribution Shifts. In each row, left is the ECE on the source dataset, right on the augmented and filtered version. Row 1: ImageNet-A hendrycks2021natural; Row 2: ImageNet-v2 recht2019imagenet; Row 3: ImageNet-R hendrycks2021many; Row 4: ImageNet-Sketch wang2019learning.
Figure 5: Reliability diagram (10 bins) for CLIP-ViT-B-16 pretrained on LAION-2B when transferred zero-shot on ImageNet-1k. (left) Source Dataset, (right) Augmented version of the dataset.
...and 3 more figures

Theorems & Definitions (2)

Proposition 2.1
proof

Frustratingly Easy Test-Time Adaptation of Vision-Language Models

TL;DR

Abstract

Frustratingly Easy Test-Time Adaptation of Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (2)