Table of Contents
Fetching ...

WATT: Weight Average Test-Time Adaptation of CLIP

David Osowiechi, Mehrdad Noori, Gustavo Adolfo Vargas Hakim, Moslem Yazdanpanah, Ali Bahri, Milad Cheraghalikhani, Sahar Dastani, Farzad Beizaee, Ismail Ben Ayed, Christian Desrosiers

TL;DR

Weight Average Test-Time Adaptation of CLIP is presented, a pioneering approach facilitating full test-time adaptation (TTA) of this VLM, augmenting the existing framework of CLIP and introducing a text ensemble strategy, enhancing overall test performance by aggregating diverse textual cues.

Abstract

Vision-Language Models (VLMs) such as CLIP have yielded unprecedented performance for zero-shot image classification, yet their generalization capability may still be seriously challenged when confronted to domain shifts. In response, we present Weight Average Test-Time Adaptation (WATT) of CLIP, a pioneering approach facilitating full test-time adaptation (TTA) of this VLM. Our method employs a diverse set of templates for text prompts, augmenting the existing framework of CLIP. Predictions are utilized as pseudo labels for model updates, followed by weight averaging to consolidate the learned information globally. Furthermore, we introduce a text ensemble strategy, enhancing overall test performance by aggregating diverse textual cues. Our findings underscore the efficacy of WATT in enhancing performance across diverse datasets, including CIFAR-10-C, CIFAR-10.1, CIFAR-100-C, VisDA-C, and several other challenging datasets, effectively covering a wide range of domain shifts. Notably, these enhancements are achieved without necessitating additional model transformations or trainable modules. Moreover, compared to other Test-Time Adaptation methods, our approach can operate effectively with just a single image. Highlighting the potential of innovative test-time strategies, this research emphasizes their role in fortifying the adaptability of VLMs. The implementation is available at: \url{https://github.com/Mehrdad-Noori/WATT.git}.

WATT: Weight Average Test-Time Adaptation of CLIP

TL;DR

Weight Average Test-Time Adaptation of CLIP is presented, a pioneering approach facilitating full test-time adaptation (TTA) of this VLM, augmenting the existing framework of CLIP and introducing a text ensemble strategy, enhancing overall test performance by aggregating diverse textual cues.

Abstract

Vision-Language Models (VLMs) such as CLIP have yielded unprecedented performance for zero-shot image classification, yet their generalization capability may still be seriously challenged when confronted to domain shifts. In response, we present Weight Average Test-Time Adaptation (WATT) of CLIP, a pioneering approach facilitating full test-time adaptation (TTA) of this VLM. Our method employs a diverse set of templates for text prompts, augmenting the existing framework of CLIP. Predictions are utilized as pseudo labels for model updates, followed by weight averaging to consolidate the learned information globally. Furthermore, we introduce a text ensemble strategy, enhancing overall test performance by aggregating diverse textual cues. Our findings underscore the efficacy of WATT in enhancing performance across diverse datasets, including CIFAR-10-C, CIFAR-10.1, CIFAR-100-C, VisDA-C, and several other challenging datasets, effectively covering a wide range of domain shifts. Notably, these enhancements are achieved without necessitating additional model transformations or trainable modules. Moreover, compared to other Test-Time Adaptation methods, our approach can operate effectively with just a single image. Highlighting the potential of innovative test-time strategies, this research emphasizes their role in fortifying the adaptability of VLMs. The implementation is available at: \url{https://github.com/Mehrdad-Noori/WATT.git}.
Paper Structure (16 sections, 2 equations, 5 figures, 24 tables, 2 algorithms)

This paper contains 16 sections, 2 equations, 5 figures, 24 tables, 2 algorithms.

Figures (5)

  • Figure 1: Loss and Error surfaces on model parameters for the Gaussian noise corruption of the CIFAR-10C dataset. Points $T^0$, $T^1$, and $T^2$ represent models adapted with different text templates (please see Table \ref{['tab:templates']}). The central point (cross) shows the model obtained by averaging these weights, demonstrating improved performance.
  • Figure 2: Overview of the proposed WATT method. In the Adaptation Phase, the model is adapted using different text templates ($T^0$, $T^1$, ..., $T^H$), with weight averaging performed periodically. In the Evaluation Phase, the adapted CLIP model uses averaged text embeddings from all templates and the weight averaged model to predict the class of the test image.
  • Figure 3: Visual comparison of the Parallel (left) and Sequential (right) approaches for multi-template weight averaging during adaptation.
  • Figure 4: Evolution of the accuracy for different numbers of random template on 5 test-time runs.
  • Figure 5: Evolution of accuracy on CIFAR-100 corruptions with the Parallel MTWA method.