Table of Contents
Fetching ...

Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models

Shambhavi Mishra, Julio Silva-Rodriguez, Ismail Ben Ayed, Marco Pedersoli, Jose Dolz

TL;DR

This work tackles robust test-time adaptation of vision-language models by reframing adaptation as cross-modal alignment between test-image embeddings and fixed semantic anchors derived from text. The proposed Semantic Anchor Transport (SAT) uses batch-wise Optimal Transport with the Sinkhorn algorithm to produce robust pseudo-labels that guide online adaptation, mitigating error accumulation common in prior approaches. SAT further leverages multi-template distillation to incorporate diverse textual cues without heavy computation, demonstrating strong gains across diverse corruptions and domain shifts while maintaining efficiency. The approach yields state-of-the-art performance on CLIP-based TTA benchmarks and generalizes across backbones and multi-modal architectures, making it a scalable and effective solution for real-world deployment under distribution shifts.

Abstract

Large pre-trained vision-language models (VLMs), such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we investigate how to efficiently utilize class text information to mitigate distribution drifts encountered by VLMs during inference. In particular, we propose generating pseudo-labels for the noisy test-time samples by aligning visual embeddings with reliable, text-based semantic anchors. Specifically, to maintain the regular structure of the dataset properly, we formulate the problem as a batch-wise label assignment, which is efficiently solved using Optimal Transport. Our method, Semantic Anchor Transport (SAT), utilizes such pseudo-labels as supervisory signals for test-time adaptation, yielding a principled cross-modal alignment solution. Moreover, SAT further leverages heterogeneous textual clues, with a multi-template distillation approach that replicates multi-view contrastive learning strategies in unsupervised representation learning without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of SAT, achieving consistent performance gains over recent state-of-the-art methods, yet being computationally efficient.

Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models

TL;DR

This work tackles robust test-time adaptation of vision-language models by reframing adaptation as cross-modal alignment between test-image embeddings and fixed semantic anchors derived from text. The proposed Semantic Anchor Transport (SAT) uses batch-wise Optimal Transport with the Sinkhorn algorithm to produce robust pseudo-labels that guide online adaptation, mitigating error accumulation common in prior approaches. SAT further leverages multi-template distillation to incorporate diverse textual cues without heavy computation, demonstrating strong gains across diverse corruptions and domain shifts while maintaining efficiency. The approach yields state-of-the-art performance on CLIP-based TTA benchmarks and generalizes across backbones and multi-modal architectures, making it a scalable and effective solution for real-world deployment under distribution shifts.

Abstract

Large pre-trained vision-language models (VLMs), such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we investigate how to efficiently utilize class text information to mitigate distribution drifts encountered by VLMs during inference. In particular, we propose generating pseudo-labels for the noisy test-time samples by aligning visual embeddings with reliable, text-based semantic anchors. Specifically, to maintain the regular structure of the dataset properly, we formulate the problem as a batch-wise label assignment, which is efficiently solved using Optimal Transport. Our method, Semantic Anchor Transport (SAT), utilizes such pseudo-labels as supervisory signals for test-time adaptation, yielding a principled cross-modal alignment solution. Moreover, SAT further leverages heterogeneous textual clues, with a multi-template distillation approach that replicates multi-view contrastive learning strategies in unsupervised representation learning without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of SAT, achieving consistent performance gains over recent state-of-the-art methods, yet being computationally efficient.

Paper Structure

This paper contains 27 sections, 9 equations, 5 figures, 14 tables, 3 algorithms.

Figures (5)

  • Figure 1: Error Accumulation. We track examples from all corruptions from CIFAR10C dataset with initial zero-shot CLIP predictions (at $t=0$) that are misclassified. Baselines wangtentmaharana2024textttosowiechi2024watt catastrophically reinforce this error: similarity to the ‘Wrong Class’ (dashed lines) increases while similarity to the ‘True Class’ (solid lines) decreases. In contrast, our method, SAT (orange), is the only one that provides a corrective signal, actively reducing similarity to the wrong class and breaking the cycle of error accumulation.
  • Figure 2: SAT leverages Optimal Transport (with the Sinkhorn algorithm) to yield soft-codes $Q^*_m$ (\ref{['eq:individual-codes']}). Then, it minimizes the pseudo cross-entropy between $Q^*_m$ and the CLIP predictions $P$ as unsupervised loss during test-time adaptation (\ref{['eq:posterior']}). Note that, at each test batch, our model runs for $m$ iterations, each using a different text template $m$ to obtain the pseudo-codes $Q^*_m$.
  • Figure 3: Ablation on each component. Results come from adding each element of SAT. In green, performance differences compared to CLIP (numerical values in Appendix \ref{['ssec:num-values']}).
  • Figure 4: Inference Runtime Comparison. Runtime in seconds for ViT-B/32 TTA methods on an NVIDIA RTX A6000 using a test-time batch of size=128 images.
  • Figure 5: Ablation on epsilon values across multiple datasets.