Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models

Shambhavi Mishra; Julio Silva-Rodriguez; Ismail Ben Ayed; Marco Pedersoli; Jose Dolz

Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models

Shambhavi Mishra, Julio Silva-Rodriguez, Ismail Ben Ayed, Marco Pedersoli, Jose Dolz

TL;DR

This work tackles robust test-time adaptation of vision-language models by reframing adaptation as cross-modal alignment between test-image embeddings and fixed semantic anchors derived from text. The proposed Semantic Anchor Transport (SAT) uses batch-wise Optimal Transport with the Sinkhorn algorithm to produce robust pseudo-labels that guide online adaptation, mitigating error accumulation common in prior approaches. SAT further leverages multi-template distillation to incorporate diverse textual cues without heavy computation, demonstrating strong gains across diverse corruptions and domain shifts while maintaining efficiency. The approach yields state-of-the-art performance on CLIP-based TTA benchmarks and generalizes across backbones and multi-modal architectures, making it a scalable and effective solution for real-world deployment under distribution shifts.

Abstract

Large pre-trained vision-language models (VLMs), such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we investigate how to efficiently utilize class text information to mitigate distribution drifts encountered by VLMs during inference. In particular, we propose generating pseudo-labels for the noisy test-time samples by aligning visual embeddings with reliable, text-based semantic anchors. Specifically, to maintain the regular structure of the dataset properly, we formulate the problem as a batch-wise label assignment, which is efficiently solved using Optimal Transport. Our method, Semantic Anchor Transport (SAT), utilizes such pseudo-labels as supervisory signals for test-time adaptation, yielding a principled cross-modal alignment solution. Moreover, SAT further leverages heterogeneous textual clues, with a multi-template distillation approach that replicates multi-view contrastive learning strategies in unsupervised representation learning without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of SAT, achieving consistent performance gains over recent state-of-the-art methods, yet being computationally efficient.

Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models

TL;DR

Abstract

Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)