Table of Contents
Fetching ...

Noise is an Efficient Learner for Zero-Shot Vision-Language Models

Raza Imam, Asif Hanif, Jian Zhang, Khaled Waleed Dawoud, Yova Kementchedjhieva, Mohammad Yaqub

TL;DR

This work tackles the problem of distribution shifts in zero-shot vision-language models by introducing Test-Time Noise Tuning (TNT), which optimizes a learnable noise in the visual input space for a single test sample. TNT jointly minimizes an entropy-based loss and an inter-view consistency loss across augmented views, enabling adaptive feature learning without updating the model weights. By selecting top-$K$ confident views and applying temperature scaling during inference, TNT achieves strong out-of-distribution generalization and improved calibration on natural shift and cross-dataset benchmarks, with substantial gains over zero-shot CLIP. The approach offers a lightweight, non-parametric pathway to robust VLMs and opens avenues for extending noise-adaptive strategies to retrieval and medical-imaging tasks while highlighting future work on memory efficiency.

Abstract

Recently, test-time adaptation has garnered attention as a method for tuning models without labeled data. The conventional modus operandi for adapting pre-trained vision-language models (VLMs) during test-time primarily focuses on tuning learnable prompts; however, this approach overlooks potential distribution shifts in the visual representations themselves. In this work, we address this limitation by introducing Test-Time Noise Tuning (TNT), a novel method for handling unpredictable shifts in the visual space. TNT leverages, for the first time, a noise adaptation strategy that optimizes learnable noise directly in the visual input space, enabling adaptive feature learning from a single test sample. We further introduce a novel approach for inter-view representation alignment by explicitly enforcing coherence in embedding distances, ensuring consistent feature representations across views. Combined with scaled logits and confident view selection at inference, TNT substantially enhances VLM generalization and calibration, achieving average gains of +7.38% on natural distributions benchmark and +0.80% on cross-dataset evaluations over zero-shot CLIP. These improvements lay a strong foundation for adaptive out-of-distribution handling.

Noise is an Efficient Learner for Zero-Shot Vision-Language Models

TL;DR

This work tackles the problem of distribution shifts in zero-shot vision-language models by introducing Test-Time Noise Tuning (TNT), which optimizes a learnable noise in the visual input space for a single test sample. TNT jointly minimizes an entropy-based loss and an inter-view consistency loss across augmented views, enabling adaptive feature learning without updating the model weights. By selecting top- confident views and applying temperature scaling during inference, TNT achieves strong out-of-distribution generalization and improved calibration on natural shift and cross-dataset benchmarks, with substantial gains over zero-shot CLIP. The approach offers a lightweight, non-parametric pathway to robust VLMs and opens avenues for extending noise-adaptive strategies to retrieval and medical-imaging tasks while highlighting future work on memory efficiency.

Abstract

Recently, test-time adaptation has garnered attention as a method for tuning models without labeled data. The conventional modus operandi for adapting pre-trained vision-language models (VLMs) during test-time primarily focuses on tuning learnable prompts; however, this approach overlooks potential distribution shifts in the visual representations themselves. In this work, we address this limitation by introducing Test-Time Noise Tuning (TNT), a novel method for handling unpredictable shifts in the visual space. TNT leverages, for the first time, a noise adaptation strategy that optimizes learnable noise directly in the visual input space, enabling adaptive feature learning from a single test sample. We further introduce a novel approach for inter-view representation alignment by explicitly enforcing coherence in embedding distances, ensuring consistent feature representations across views. Combined with scaled logits and confident view selection at inference, TNT substantially enhances VLM generalization and calibration, achieving average gains of +7.38% on natural distributions benchmark and +0.80% on cross-dataset evaluations over zero-shot CLIP. These improvements lay a strong foundation for adaptive out-of-distribution handling.

Paper Structure

This paper contains 13 sections, 8 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: As top-$K$ augmented view embeddings grow more consistent with each optimization step $t$, the attention mechanism focuses on relevant regions, leading to improved accuracy. Attention Difference illustrates the absolute difference between the clean attention map and the noise-tuned attention map. CLIP zero-shot incorrectly classifies the original image as amaga, while TNT correctly classifies the optimized image as garter snake.
  • Figure 2: Test-Time Noise Tuning (TNT) (1) generates augmented views of a test image, (2) applies adaptive learnable noise, and (3) computes logits and feature vectors for each view. (4) Top-$K$ views are selected by confidence, with (5) entropy loss [Eq. \ref{['eq:entropy']}] enforcing confident predictions and (6) inter-view consistency loss [Eq. \ref{['eq:Lvc_loss']}] aligning feature representations. (7) The combined loss is backpropagated to iteratively refine the noise, enabling adaptive test-time noise tuning.
  • Figure 3: Analysis of Trainable Parameters (TP) for TNT, textual tuning, and encoder tuning. Circle size indicates the #TP. Textual Tuning methods use the same TP count of 2K to optimize prompts. RLCF* refers to RLCF with all visual encoder parameters trainable, Layer Norm limits trainable parameters of visual encoder to only Layer Norms, and Visual Prompt applies learnable prompts to the visual encoder across 12 layers of the ViT encoder. TNT$\ddag$ indicates TNT with only $224\times9$ trainable noise parameters, compared to standard TNT with $224\times224\times3$ TP. Noise denotes optimization with $224\times9$ TP in noise and with only $\mathcal{L}_{\text{entropy}}$ loss.
  • Figure 4: Effect of TNT Components on (a) Top-1 Accuracy (Higher$\uparrow$ is better) and (b) ECE (Lower$\downarrow$ is better). E: Noise optimization with Entropy minimization $\mathcal{L}_{\text{entropy}}$. E+V: Adds $\mathcal{L}_{\text{vc}}$ loss (Eq. \ref{['eq:Lvc_loss']}) to $\mathcal{L}_{\text{entropy}}$. E+V+T$^\prime$: Adds temperature scaling during inference to E+V. E+V+T: Makes use of top-$K$ views instead of one test image (Eq. \ref{['eq:inference']}), i.e. TNT*. TNT: TNT* with CoOp initialization, TNT+PT(Prompt Tuning): Optimizes textual prompts with TNT. TNT+ET(Encoder Tuning): Optimizes the visual encoder with TNT. Optimization Steps $t=1$ is used consistently. The same Legend is used for (a) and (b).
  • Figure 5: Increasing the number of optimization steps and augmentations both result in higher (a) Top-1 Accuracy, and lower (b) Expected Calibration (EC) Error. TNT* and TNT denote hand-crafted and CoOp-based prompts, respectively. The legend is shared throughout.
  • ...and 1 more figures