Table of Contents
Fetching ...

BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

Xuefeng Hu, Ke Zhang, Min Sun, Albert Chen, Cheng-Hao Kuo, Ram Nevatia

TL;DR

BaFTA addresses the challenge of improving zero-shot vision-language models at test time without backpropagation. It directly estimates class embeddings through online clustering in a projection space that aligns visual and text embeddings, and it combines predictions from text-based embeddings and online centroids using Rényi Entropy for reliability weighting. The method yields consistent accuracy gains across ImageNet and fine-grained datasets and significantly reduces inference time compared to gradient-based test-time prompts, enabling scalable deployment. The approach demonstrates practical robustness by leveraging augmentation, multi-source predictions, and a projection-based alignment, making test-time adaptation stable and efficient for large-scale VLMs like CLIP.

Abstract

Large-scale pretrained vision-language models like CLIP have demonstrated remarkable zero-shot image classification capabilities across diverse domains. To enhance CLIP's performance while preserving the zero-shot paradigm, various test-time prompt tuning methods have been introduced to refine class embeddings through unsupervised learning objectives during inference. However, these methods often encounter challenges in selecting appropriate learning rates to prevent collapsed training in the absence of validation data during test-time adaptation. In this study, we propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models. Instead of fine-tuning text prompts to refine class embeddings, our approach directly estimates class centroids using online clustering within a projected embedding space that aligns text and visual embeddings. We dynamically aggregate predictions from both estimated and original class embeddings, as well as from distinct augmented views, by assessing the reliability of each prediction using Rényi Entropy. Through extensive experiments, we demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency.

BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

TL;DR

BaFTA addresses the challenge of improving zero-shot vision-language models at test time without backpropagation. It directly estimates class embeddings through online clustering in a projection space that aligns visual and text embeddings, and it combines predictions from text-based embeddings and online centroids using Rényi Entropy for reliability weighting. The method yields consistent accuracy gains across ImageNet and fine-grained datasets and significantly reduces inference time compared to gradient-based test-time prompts, enabling scalable deployment. The approach demonstrates practical robustness by leveraging augmentation, multi-source predictions, and a projection-based alignment, making test-time adaptation stable and efficient for large-scale VLMs like CLIP.

Abstract

Large-scale pretrained vision-language models like CLIP have demonstrated remarkable zero-shot image classification capabilities across diverse domains. To enhance CLIP's performance while preserving the zero-shot paradigm, various test-time prompt tuning methods have been introduced to refine class embeddings through unsupervised learning objectives during inference. However, these methods often encounter challenges in selecting appropriate learning rates to prevent collapsed training in the absence of validation data during test-time adaptation. In this study, we propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models. Instead of fine-tuning text prompts to refine class embeddings, our approach directly estimates class centroids using online clustering within a projected embedding space that aligns text and visual embeddings. We dynamically aggregate predictions from both estimated and original class embeddings, as well as from distinct augmented views, by assessing the reliability of each prediction using Rényi Entropy. Through extensive experiments, we demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency.
Paper Structure (21 sections, 11 equations, 3 figures, 11 tables, 1 algorithm)

This paper contains 21 sections, 11 equations, 3 figures, 11 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the Backpropagation-Free Test-Time Adaptation algorithm BaFTA. Instead of prompt-tuning, we employ online clustering to directly estimate class embeddings in a projection space that aligns visual and text embeddings. The class centroids are initialized with text embeddings of class names, and updated incrementally with online test examples assigned to the class. For each test example, we generate two sets of predictions. The first set measures cosine similarity between visual embeddings of augmented views and class name text embeddings. The second set measures cosine similarity between visual embeddings and online-clustering centroids. Predictions are aggregated with reliability estimated by Rényi Entropy for final results.
  • Figure 2: $\alpha$-accuracy curves on 15 datasets, with $\alpha\in[0.1,0.99]$. In order to fit all curves into one plot with unified value range, all curves are normalized by subtracting the maximum accuracy within the curve. The bold red curve represents the averaged accuracy over 15 datasets, achieves its maximum value at $\alpha=0.5$ and $\alpha=0.6$. This plot indicates that prediction aggregation accuracy is not highly sensitive to the choice of $\alpha$, with most curves exhibiting less than a 0.3% change in accuracy across the $\alpha$ range [0.1, 0.99]
  • Figure 3: tSNE plots of original and projected visual embeddings from evaluation datasets.