Table of Contents
Fetching ...

Noisy Test-Time Adaptation in Vision-Language Models

Chentao Cao, Zhun Zhong, Zhanke Zhou, Tongliang Liu, Yang Liu, Kun Zhang, Bo Han

TL;DR

This work tackles the challenge of noisy test-time data for vision-language models by introducing Zero-shot Noisy TTA (ZS-NTTA) and revealing that existing TTA methods can be overwhelmed by unfiltered noise. To address this, it proposes AdaND, a detector-classifier decoupling framework that trains an Adaptive Noise Detector on frozen VLM features using pseudo-labels from a zero-shot baseline, while Gaussian noise injections mitigate misclassification of clean data. Empirically, AdaND delivers state-of-the-art results on ZS-NTTA and competitive improvements on ZS-OOD across ImageNet and diverse datasets, with substantial gains in $ ext{Acc}_{ ext{H}}$ (up to about 8.32 percentage points) and $ ext{FPR}_{95}$ (up to about 9.40 percentage points), while maintaining runtime comparable to model-frozen methods. The approach is zero-shot, noise-agnostic, and plug-and-play with existing TTA methods, and the authors provide benchmarks and public code to facilitate broader adoption in open-world vision-language applications.

Abstract

Test-time adaptation (TTA) aims to address distribution shifts between source and target data by relying solely on target data during testing. In open-world scenarios, models often encounter noisy samples, i.e., samples outside the in-distribution (ID) label space. Leveraging the zero-shot capability of pre-trained vision-language models (VLMs), this paper introduces Zero-Shot Noisy TTA (ZS-NTTA), focusing on adapting the model to target data with noisy samples during test-time in a zero-shot manner. We find existing TTA methods underperform under ZS-NTTA, often lagging behind even the frozen model. We conduct comprehensive experiments to analyze this phenomenon, revealing that the negative impact of unfiltered noisy data outweighs the benefits of clean data during model updating. Also, adapting a classifier for ID classification and noise detection hampers both sub-tasks. Built on this, we propose a framework that decouples the classifier and detector, focusing on developing an individual detector while keeping the classifier frozen. Technically, we introduce the Adaptive Noise Detector (AdaND), which utilizes the frozen model's outputs as pseudo-labels to train a noise detector. To handle clean data streams, we further inject Gaussian noise during adaptation, preventing the detector from misclassifying clean samples as noisy. Beyond the ZS-NTTA, AdaND can also improve the zero-shot out-of-distribution (ZS-OOD) detection ability of VLMs. Experiments show that AdaND outperforms in both ZS-NTTA and ZS-OOD detection. On ImageNet, AdaND achieves a notable improvement of $8.32\%$ in harmonic mean accuracy ($\text{Acc}_\text{H}$) for ZS-NTTA and $9.40\%$ in FPR95 for ZS-OOD detection, compared to SOTA methods. Importantly, AdaND is computationally efficient and comparable to the model-frozen method. The code is publicly available at: https://github.com/tmlr-group/ZS-NTTA.

Noisy Test-Time Adaptation in Vision-Language Models

TL;DR

This work tackles the challenge of noisy test-time data for vision-language models by introducing Zero-shot Noisy TTA (ZS-NTTA) and revealing that existing TTA methods can be overwhelmed by unfiltered noise. To address this, it proposes AdaND, a detector-classifier decoupling framework that trains an Adaptive Noise Detector on frozen VLM features using pseudo-labels from a zero-shot baseline, while Gaussian noise injections mitigate misclassification of clean data. Empirically, AdaND delivers state-of-the-art results on ZS-NTTA and competitive improvements on ZS-OOD across ImageNet and diverse datasets, with substantial gains in (up to about 8.32 percentage points) and (up to about 9.40 percentage points), while maintaining runtime comparable to model-frozen methods. The approach is zero-shot, noise-agnostic, and plug-and-play with existing TTA methods, and the authors provide benchmarks and public code to facilitate broader adoption in open-world vision-language applications.

Abstract

Test-time adaptation (TTA) aims to address distribution shifts between source and target data by relying solely on target data during testing. In open-world scenarios, models often encounter noisy samples, i.e., samples outside the in-distribution (ID) label space. Leveraging the zero-shot capability of pre-trained vision-language models (VLMs), this paper introduces Zero-Shot Noisy TTA (ZS-NTTA), focusing on adapting the model to target data with noisy samples during test-time in a zero-shot manner. We find existing TTA methods underperform under ZS-NTTA, often lagging behind even the frozen model. We conduct comprehensive experiments to analyze this phenomenon, revealing that the negative impact of unfiltered noisy data outweighs the benefits of clean data during model updating. Also, adapting a classifier for ID classification and noise detection hampers both sub-tasks. Built on this, we propose a framework that decouples the classifier and detector, focusing on developing an individual detector while keeping the classifier frozen. Technically, we introduce the Adaptive Noise Detector (AdaND), which utilizes the frozen model's outputs as pseudo-labels to train a noise detector. To handle clean data streams, we further inject Gaussian noise during adaptation, preventing the detector from misclassifying clean samples as noisy. Beyond the ZS-NTTA, AdaND can also improve the zero-shot out-of-distribution (ZS-OOD) detection ability of VLMs. Experiments show that AdaND outperforms in both ZS-NTTA and ZS-OOD detection. On ImageNet, AdaND achieves a notable improvement of in harmonic mean accuracy () for ZS-NTTA and in FPR95 for ZS-OOD detection, compared to SOTA methods. Importantly, AdaND is computationally efficient and comparable to the model-frozen method. The code is publicly available at: https://github.com/tmlr-group/ZS-NTTA.

Paper Structure

This paper contains 54 sections, 3 equations, 9 figures, 30 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison between TTA, noisy TTA, zero-shot OOD detection, and the proposed zero-shot noisy TTA. Only zero-shot noisy TTA focuses on both clean/noisy classification accuracy and performs in a task-agnostic / zero-shot manner. ZS-NTTA requires online detection of noisy samples.
  • Figure 2: Performance ranking distribution of five TTA methods across $44$ ID-OOD dataset pairs. The ranks of different methods on one ID-OOD pair are ranked according to accuracy $\text{Acc}_\text{H}$. A rank closer to $1$ denotes better performance, and a larger bottom area reflects superior overall performance. We also evaluate these methods using absolute accuracy in Figure \ref{['fig:absolute_acc']} in Appendix \ref{['app:failure case']}.
  • Figure 3: Failure case analysis of Tent wang2021tent in ZS-NTTA. (a) and (b) show the score distributions of ZS-CLIP and Tent, respectively, revealing that Tent makes it difficult to distinguish between clean and noisy samples. The horizontal axis is the value of OOD score. (c) illustrates the score difference between Tent and ZS-CLIP, indicating that the confidence of noisy samples tends to increase in Tent. ID dataset: CIFAR-10; OOD dataset: SVHN.
  • Figure 4: The impact of clean and noisy samples on the gradients. Note that the gradients of noisy samples are substantially larger in the first and second stages. The model effectively filters out noisy samples in the first stage but gradually struggles to distinguish between clean and noisy samples. ID dataset: CIFAR-10; OOD dataset: SVHN; Batch size: $64$. Please see Figure \ref{['app-fig:fail-gradient']} for an enlarged view.
  • Figure 5: Overview of the proposed framework. We use the detection results from ZS-CLIP as pseudo-labels to train the Adaptive Noise Detector (AdaND). In the early stage, we directly use the ZS-CLIP to distinguish clean-noise samples, while in the later stage, we use the AdaND instead. The predicted clean samples are then classified based on the text-based classifier. To further handle the clean data stream case, we intentionally inject Gaussian noise as additional noisy samples to avoid wrongly assigning too many clean samples as noisy ones.
  • ...and 4 more figures