Learning Visual Conditioning Tokens to Correct Domain Shift for Fully Test-time Adaptation
Yushun Tang, Shuoshuo Chen, Zhehan Kan, Yi Zhang, Qinghai Guo, Zhihai He
TL;DR
This paper tackles the challenge of cross-domain performance drop under distribution shift by proposing a Visual Conditioning Token (VCT) learned at the first transformer layer of a Vision Transformer (ViT) to progressively remove domain-shift perturbations during fully test-time adaptation. It introduces a bi-level learning framework comprising a long-term domain-specific token (DS-VCT) and a short-term instance-specific token (IS-VCT), which are integrated with input patch embeddings and updated online via reliable entropy minimization and sharpness-aware entropy minimization. Empirical results across ImageNet-C, ImageNet-R, VisDA-2021, and Office-Home demonstrate consistent improvements over state-of-the-art fully TTA methods, with gains up to around 1.9 percentage points in challenging settings and robustness to small-batch inference. The work advances transformer-based TTA by leveraging domain-conditioning tokens to capture both global domain priors and local instance variability, offering a practical strategy for robust deployment under real-world distribution shifts.
Abstract
Fully test-time adaptation aims to adapt the network model based on sequential analysis of input samples during the inference stage to address the cross-domain performance degradation problem of deep neural networks. This work is based on the following interesting finding: in transformer-based image classification, the class token at the first transformer encoder layer can be learned to capture the domain-specific characteristics of target samples during test-time adaptation. This learned token, when combined with input image patch embeddings, is able to gradually remove the domain-specific information from the feature representations of input samples during the transformer encoding process, thereby significantly improving the test-time adaptation performance of the source model across different domains. We refer to this class token as visual conditioning token (VCT). To successfully learn the VCT, we propose a bi-level learning approach to capture the long-term variations of domain-specific characteristics while accommodating local variations of instance-specific characteristics. Experimental results on the benchmark datasets demonstrate that our proposed bi-level visual conditioning token learning method is able to achieve significantly improved test-time adaptation performance by up to 1.9%.
