Realistic Test-Time Adaptation of Vision-Language Models

Maxime Zanella; Clément Fuchs; Christophe De Vleeschouwer; Ismail Ben Ayed

Realistic Test-Time Adaptation of Vision-Language Models

Maxime Zanella, Clément Fuchs, Christophe De Vleeschouwer, Ismail Ben Ayed

TL;DR

This work addresses the problem of robust test-time adaptation for Vision-Language Models under realistic deployment conditions, where the number of effective classes per batch and inter-batch correlations can vary widely. It introduces Stat${\cal A}$, a transductive method that regularizes Gaussian class statistics with a Statistical Anchor derived from text prompts and zero-shot predictions, implemented via a block-coordinate descent with a KL-based anchor term. The paper provides two realistic evaluation settings—varying $K_{ ext{eff}}$ within batches and online non-i.i.d. data streams—along with extensive ablations, showing Stat${\cal A}$ consistently improves robustness across diverse scenarios and maintains efficiency (thousands of samples processed quickly). The results demonstrate that traditional TTA methods can degrade zero-shot robustness in realistic conditions, whereas Stat${\cal A}$ offers a practical, scalable solution for real-world VLM adaptation with minimal computational overhead.

Abstract

The zero-shot capabilities of Vision-Language Models (VLMs) have been widely leveraged to improve predictive performance. However, previous works on transductive or test-time adaptation (TTA) often make strong assumptions about the data distribution, such as the presence of all classes. Our work challenges these favorable deployment scenarios, and introduces a more realistic evaluation framework, including: (i) a variable number of effective classes for adaptation within a single batch, and (ii) non-i.i.d. batches of test samples in online adaptation settings. We provide comprehensive evaluations, comparisons, and ablation studies that demonstrate how current transductive or TTA methods for VLMs systematically compromise the models' initial zero-shot robustness across various realistic scenarios, favoring performance gains under advantageous assumptions about the test samples' distributions. Furthermore, we introduce StatA, a versatile method that could handle a wide range of deployment scenarios, including those with a variable number of effective classes at test time. Our approach incorporates a novel regularization term designed specifically for VLMs, which acts as a statistical anchor preserving the initial text-encoder knowledge, particularly in low-data regimes. Code available at https://github.com/MaxZanella/StatA.

Realistic Test-Time Adaptation of Vision-Language Models

TL;DR

, a transductive method that regularizes Gaussian class statistics with a Statistical Anchor derived from text prompts and zero-shot predictions, implemented via a block-coordinate descent with a KL-based anchor term. The paper provides two realistic evaluation settings—varying

within batches and online non-i.i.d. data streams—along with extensive ablations, showing Stat

consistently improves robustness across diverse scenarios and maintains efficiency (thousands of samples processed quickly). The results demonstrate that traditional TTA methods can degrade zero-shot robustness in realistic conditions, whereas Stat

offers a practical, scalable solution for real-world VLM adaptation with minimal computational overhead.

Abstract

Paper Structure (43 sections, 37 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 43 sections, 37 equations, 4 figures, 10 tables, 1 algorithm.

Introduction
Our contributions.
Related work
Transductive learning in VLMs.
Test-time adaptation in VLMs.
Realistic test-time adaptation
Batch realistic scenarios.
Online realistic scenarios.
Method
Formulation
Regularized Maximum Likelihood Estimation.
Proposed Statistical Anchor (StatA) term.
Regularized updates of the parameters
Interpretation.
Implementation.
...and 28 more sections

Figures (4)

Figure 1: We advocate for evaluating transductive or online TTA methods on more extensive realistic scenarios.
Figure 2: Illustration of two realistic scenarios: (a) batch adaptation with limited number of effective classes and (b) online test-time adaptation with a correlated, non-i.i.d. data stream.
Figure 3: Ablation study on the impact of the anchor weighting $\alpha$ across various numbers of effective classes ($K_{\text{eff}}$). The line corresponding to $\alpha=1$ (used in all our experiments) is highlighted with a wider stroke. Each reported performance is averaged over 1,000 tasks.
Figure 4: Correlation matrix of per-batch $\ell_2$ normalized vectors of class proportions for batch size $128$. $x$ and $y$ axis of each plot is the batch index corresponding to the order in which the batches are processed. This illustrates the inter-batch correlation increasing as the Dirichlet parameter $\gamma$ decreases.

Realistic Test-Time Adaptation of Vision-Language Models

TL;DR

Abstract

Realistic Test-Time Adaptation of Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)