Table of Contents
Fetching ...

A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-Time Adaptation for Vision-Language Models

Mario Döbler, Robert A. Marsden, Tobias Raichle, Bin Yang

TL;DR

This study systematically examines prompt-based techniques and existing test-time adaptation methods, aiming to improve the robustness under distribution shift in diverse real-world scenarios, with a particular emphasis on CLIP and its variants.

Abstract

In deep learning, maintaining model robustness against distribution shifts is critical. This work explores a broad range of possibilities to adapt vision-language foundation models at test-time, with a particular emphasis on CLIP and its variants. The study systematically examines prompt-based techniques and existing test-time adaptation methods, aiming to improve the robustness under distribution shift in diverse real-world scenarios. Specifically, the investigation covers various prompt engineering strategies, including handcrafted prompts, prompt ensembles, and prompt learning techniques. Additionally, we introduce a vision-text-space ensemble that substantially enhances average performance compared to text-space-only ensembles. Since online test-time adaptation has shown to be effective to mitigate performance drops under distribution shift, the study extends its scope to evaluate the effectiveness of existing test-time adaptation methods that were originally designed for vision-only classification models. Through extensive experimental evaluations conducted across multiple datasets and diverse model architectures, the research demonstrates the effectiveness of these adaptation strategies. Code is available at: https://github.com/mariodoebler/test-time-adaptation

A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-Time Adaptation for Vision-Language Models

TL;DR

This study systematically examines prompt-based techniques and existing test-time adaptation methods, aiming to improve the robustness under distribution shift in diverse real-world scenarios, with a particular emphasis on CLIP and its variants.

Abstract

In deep learning, maintaining model robustness against distribution shifts is critical. This work explores a broad range of possibilities to adapt vision-language foundation models at test-time, with a particular emphasis on CLIP and its variants. The study systematically examines prompt-based techniques and existing test-time adaptation methods, aiming to improve the robustness under distribution shift in diverse real-world scenarios. Specifically, the investigation covers various prompt engineering strategies, including handcrafted prompts, prompt ensembles, and prompt learning techniques. Additionally, we introduce a vision-text-space ensemble that substantially enhances average performance compared to text-space-only ensembles. Since online test-time adaptation has shown to be effective to mitigate performance drops under distribution shift, the study extends its scope to evaluate the effectiveness of existing test-time adaptation methods that were originally designed for vision-only classification models. Through extensive experimental evaluations conducted across multiple datasets and diverse model architectures, the research demonstrates the effectiveness of these adaptation strategies. Code is available at: https://github.com/mariodoebler/test-time-adaptation
Paper Structure (20 sections, 5 equations, 5 figures, 3 tables)

This paper contains 20 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the proposed VTE approach and the application of existing TTA methods for VLMs. Before inference, an average text representation $\bar{\bm{t}}_k$ for each of the $K$ classes is extracted by mapping a list of prompts into the text embedding space. During inference, VTE uses test-time augmentation and entropy based filtering. In the case of applying TTA methods, only the parameters of the vision encoder are updated.
  • Figure 2: Average error rate of VTE with a ViT-B-16 backbone across all 17 datasets when using different numbers of augmentations during test-time. The dashed line indicates the performance of zero-shot CLIP with Ensemble prompts.
  • Figure 3: Average error rate for CLIP with a RN50 and RN101 backbone for both source and BN--1. As illustrated, the error rate drastically increases when the normalization statistics are recalculated during test-time.
  • Figure 4: UMAP visualization for EuroSAT (top) and Pets (bottom) before (left) and after adaptation (right). To better align the text and image embeddings, we use a projection proposed in hu2024reclip before applying UMAP. The triangles illustrate the corresponding text ensemble embeddings.
  • Figure 5: Comparison of different models sorted according to their number of parameters from low (left) to high (right). The average error rate across all datasets is reported.