Table of Contents
Fetching ...

Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models

Raza Imam, Hanan Gani, Muhammad Huzaifa, Karthik Nandakumar

TL;DR

This work addresses zero-shot generalization under distribution shift for vision-language models by introducing Test-Time Low-Rank Adaptation (TTL). TTL inserts low-rank adapters into the visual encoder's self-attention and updates them in a single test-time step using a weighted entropy loss over multiple augmented views, all without access to source data or pre-trained prompts. Empirical results show TTL yields consistent gains over state-of-the-art test-time and zero-shot baselines across natural distribution shifts and cross-dataset transfers, highlighting its parameter efficiency and robustness. The combination of LoRA-based attention adaptation and a confidence-maximizing objective offers a practical pathway to deploy strong VLM generalization in real-world, out-of-domain scenarios.

Abstract

The conventional modus operandi for adapting pre-trained vision-language models (VLMs) during test-time involves tuning learnable prompts, ie, test-time prompt tuning. This paper introduces Test-Time Low-rank adaptation (TTL) as an alternative to prompt tuning for zero-shot generalization of large-scale VLMs. Taking inspiration from recent advancements in efficiently fine-tuning large language models, TTL offers a test-time parameter-efficient adaptation approach that updates the attention weights of the transformer encoder by maximizing prediction confidence. The self-supervised confidence maximization objective is specified using a weighted entropy loss that enforces consistency among predictions of augmented samples. TTL introduces only a small amount of trainable parameters for low-rank adapters in the model space while keeping the prompts and backbone frozen. Extensive experiments on a variety of natural distribution and cross-domain tasks show that TTL can outperform other techniques for test-time optimization of VLMs in strict zero-shot settings. Specifically, TTL outperforms test-time prompt tuning baselines with a significant improvement on average. Our code is available at at https://github.com/Razaimam45/TTL-Test-Time-Low-Rank-Adaptation.

Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models

TL;DR

This work addresses zero-shot generalization under distribution shift for vision-language models by introducing Test-Time Low-Rank Adaptation (TTL). TTL inserts low-rank adapters into the visual encoder's self-attention and updates them in a single test-time step using a weighted entropy loss over multiple augmented views, all without access to source data or pre-trained prompts. Empirical results show TTL yields consistent gains over state-of-the-art test-time and zero-shot baselines across natural distribution shifts and cross-dataset transfers, highlighting its parameter efficiency and robustness. The combination of LoRA-based attention adaptation and a confidence-maximizing objective offers a practical pathway to deploy strong VLM generalization in real-world, out-of-domain scenarios.

Abstract

The conventional modus operandi for adapting pre-trained vision-language models (VLMs) during test-time involves tuning learnable prompts, ie, test-time prompt tuning. This paper introduces Test-Time Low-rank adaptation (TTL) as an alternative to prompt tuning for zero-shot generalization of large-scale VLMs. Taking inspiration from recent advancements in efficiently fine-tuning large language models, TTL offers a test-time parameter-efficient adaptation approach that updates the attention weights of the transformer encoder by maximizing prediction confidence. The self-supervised confidence maximization objective is specified using a weighted entropy loss that enforces consistency among predictions of augmented samples. TTL introduces only a small amount of trainable parameters for low-rank adapters in the model space while keeping the prompts and backbone frozen. Extensive experiments on a variety of natural distribution and cross-domain tasks show that TTL can outperform other techniques for test-time optimization of VLMs in strict zero-shot settings. Specifically, TTL outperforms test-time prompt tuning baselines with a significant improvement on average. Our code is available at at https://github.com/Razaimam45/TTL-Test-Time-Low-Rank-Adaptation.
Paper Structure (18 sections, 7 equations, 13 figures, 6 tables)

This paper contains 18 sections, 7 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: (a) Entropy corresponding to 8 different octiles result in different performance for Flowers102. (b) TTL implicitly align features such that the mean embeddings of test samples better align with that of source data (LAION) on which CLIP radford2021learning is trained.
  • Figure 2: TTL vs. other zero-shot optimization methods. (a) Current methods shu2022testfeng2023diversehassan2024align update prompts during inference using self-entropy. (b) TTL introduces low-rank learnable weight matrices at the attention layer of the vision encoder to update the model weights using weighted entropy. (c) TTL outperforms existing baselines across Out-of-Distribution and Cross-Dataset while using less than 0.1% of all model parameters.
  • Figure 3: Working of Test-Time Low-Rank Adaptation (TTL). We integrate parameter efficient low rank matrices into the self-attention module of the image encoder. We adapt these low rank weights on the fly given a single test sample, without the need for pre-trained weights or source data. Maximizing confidence via weighted entropy minimization, TTL updates the low rank weights to optimize the VLM to adapt a test sample in a single update step.
  • Figure 4: Test-time performance of zero-shot methods. CLIP vs. Textual Prompt Tuning (TPT) vs. Visual Prompt Tuning vs. Multi-modal Prompt Tuning vs.TTL (Ours) (See Figure \ref{['fig:maple_compare']}).
  • Figure 5: Test-time Low-Rank Adaption across (a) (left) different combinations of trainable model components (b) (right) different combinations of query, key, and value of image encoder.
  • ...and 8 more figures