Efficient Test-Time Adaptation of Vision-Language Models

Adilbek Karmanov; Dayan Guan; Shijian Lu; Abdulmotaleb El Saddik; Eric Xing

Efficient Test-Time Adaptation of Vision-Language Models

Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El Saddik, Eric Xing

TL;DR

This work tackles distribution shifts in vision-language models by proposing TDA, a training-free dynamic adapter that uses two non-parametric caches (positive and negative) to progressively refine test-time predictions without backpropagation. By storing few-shot test features as keys and pseudo labels as values, the positive cache enhances correct predictions, while the negative cache mitigates noise via negative pseudo labeling on uncertain samples. Across two benchmarks (OOD and Cross-Domain) and multiple CLIP backbones, TDA achieves state-of-the-art accuracy with substantial speedups (reducing test-time from hours to minutes) compared to prompts-based and other cache-based methods. The approach is robust, scalable, and practical for real-world deployment, with carefully tuned thresholds and demonstrated robustness through extensive ablations and analyses.

Abstract

Test-time adaptation with pre-trained vision-language models has attracted increasing attention for tackling distribution shifts during the test time. Though prior studies have achieved very promising performance, they involve intensive computation which is severely unaligned with test-time adaptation. We design TDA, a training-free dynamic adapter that enables effective and efficient test-time adaptation with vision-language models. TDA works with a lightweight key-value cache that maintains a dynamic queue with few-shot pseudo labels as values and the corresponding test-sample features as keys. Leveraging the key-value cache, TDA allows adapting to test data gradually via progressive pseudo label refinement which is super-efficient without incurring any backpropagation. In addition, we introduce negative pseudo labeling that alleviates the adverse impact of pseudo label noises by assigning pseudo labels to certain negative classes when the model is uncertain about its pseudo label predictions. Extensive experiments over two benchmarks demonstrate TDA's superior effectiveness and efficiency as compared with the state-of-the-art. The code has been released in \url{https://kdiaaa.github.io/tda/}.

Efficient Test-Time Adaptation of Vision-Language Models

TL;DR

Abstract

Paper Structure (25 sections, 7 equations, 5 figures, 6 tables)

This paper contains 25 sections, 7 equations, 5 figures, 6 tables.

Introduction
Related Work
Method
Preliminaries
Training-free Dynamic Adapter
Positive Cache.
Negative Cache.
Relationship with TPT and Tip-Adapter
Experiments
Experimental Setup
Benchmarks.
Implementation details.
Comparisons with State-of-the-art
Results on the OOD Benchmark.
Ablation Studies
...and 10 more sections

Figures (5)

Figure 1: Comparison of our proposed Training-free Dynamic Adapter (TDA) with Test-time Prompt Tuning TPT shu2022testtime and its enhancement DiffTPT feng2023diverse: both TPT and DiffTPT require significant computational resources to optimize the learnable prompt via backpropagation; TDA is a dynamic cache that is training-free without any backpropagation, making it efficient for test-time adaptation in various real-world scenarios.
Figure 2: Overview of the proposed Training-free Dynamic Adapter (TDA). TDA constructs and updates two key-value caches to store the knowledge of a stream of test samples, and uses the two caches to generate positive and negative predictions which are combined with CLIP predictions to produce the final prediction. Specifically, the CLIP predictions are generated by performing the dot product between the image features generated by CLIP's image encoder $E_v$ and the text embeddings generated by CLIP's text encoder $E_t$, using the hand-crafted prompt and class names. The two key-value caches are updated by gradually incorporating the test features and their corresponding pseudo labels calculated from CLIP's predictions, based on prediction entropy and cache capacity.
Figure 3: Ablation studies on two cache designs in TDA: Positive Cache and Negative Cache. All the models are built upon the baseline model CLIP-ResNet-50.
Figure 4: Parameter studies on the Shot Capacity in Positive Cache and Negative Cache.
Figure 5: Parameter studies on the Negative Mask Threshold$p_l$ for the negative pseudo-labeling in Negative Cache. The results are reported on ImageNet top-1 accuracy using only the Negative Cache to produce an adapted prediction. The experiments are conducted with CLIP-ResNet50.

Efficient Test-Time Adaptation of Vision-Language Models

TL;DR

Abstract

Efficient Test-Time Adaptation of Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)