Table of Contents
Fetching ...

Learning Visual Conditioning Tokens to Correct Domain Shift for Fully Test-time Adaptation

Yushun Tang, Shuoshuo Chen, Zhehan Kan, Yi Zhang, Qinghai Guo, Zhihai He

TL;DR

This paper tackles the challenge of cross-domain performance drop under distribution shift by proposing a Visual Conditioning Token (VCT) learned at the first transformer layer of a Vision Transformer (ViT) to progressively remove domain-shift perturbations during fully test-time adaptation. It introduces a bi-level learning framework comprising a long-term domain-specific token (DS-VCT) and a short-term instance-specific token (IS-VCT), which are integrated with input patch embeddings and updated online via reliable entropy minimization and sharpness-aware entropy minimization. Empirical results across ImageNet-C, ImageNet-R, VisDA-2021, and Office-Home demonstrate consistent improvements over state-of-the-art fully TTA methods, with gains up to around 1.9 percentage points in challenging settings and robustness to small-batch inference. The work advances transformer-based TTA by leveraging domain-conditioning tokens to capture both global domain priors and local instance variability, offering a practical strategy for robust deployment under real-world distribution shifts.

Abstract

Fully test-time adaptation aims to adapt the network model based on sequential analysis of input samples during the inference stage to address the cross-domain performance degradation problem of deep neural networks. This work is based on the following interesting finding: in transformer-based image classification, the class token at the first transformer encoder layer can be learned to capture the domain-specific characteristics of target samples during test-time adaptation. This learned token, when combined with input image patch embeddings, is able to gradually remove the domain-specific information from the feature representations of input samples during the transformer encoding process, thereby significantly improving the test-time adaptation performance of the source model across different domains. We refer to this class token as visual conditioning token (VCT). To successfully learn the VCT, we propose a bi-level learning approach to capture the long-term variations of domain-specific characteristics while accommodating local variations of instance-specific characteristics. Experimental results on the benchmark datasets demonstrate that our proposed bi-level visual conditioning token learning method is able to achieve significantly improved test-time adaptation performance by up to 1.9%.

Learning Visual Conditioning Tokens to Correct Domain Shift for Fully Test-time Adaptation

TL;DR

This paper tackles the challenge of cross-domain performance drop under distribution shift by proposing a Visual Conditioning Token (VCT) learned at the first transformer layer of a Vision Transformer (ViT) to progressively remove domain-shift perturbations during fully test-time adaptation. It introduces a bi-level learning framework comprising a long-term domain-specific token (DS-VCT) and a short-term instance-specific token (IS-VCT), which are integrated with input patch embeddings and updated online via reliable entropy minimization and sharpness-aware entropy minimization. Empirical results across ImageNet-C, ImageNet-R, VisDA-2021, and Office-Home demonstrate consistent improvements over state-of-the-art fully TTA methods, with gains up to around 1.9 percentage points in challenging settings and robustness to small-batch inference. The work advances transformer-based TTA by leveraging domain-conditioning tokens to capture both global domain priors and local instance variability, offering a practical strategy for robust deployment under real-world distribution shifts.

Abstract

Fully test-time adaptation aims to adapt the network model based on sequential analysis of input samples during the inference stage to address the cross-domain performance degradation problem of deep neural networks. This work is based on the following interesting finding: in transformer-based image classification, the class token at the first transformer encoder layer can be learned to capture the domain-specific characteristics of target samples during test-time adaptation. This learned token, when combined with input image patch embeddings, is able to gradually remove the domain-specific information from the feature representations of input samples during the transformer encoding process, thereby significantly improving the test-time adaptation performance of the source model across different domains. We refer to this class token as visual conditioning token (VCT). To successfully learn the VCT, we propose a bi-level learning approach to capture the long-term variations of domain-specific characteristics while accommodating local variations of instance-specific characteristics. Experimental results on the benchmark datasets demonstrate that our proposed bi-level visual conditioning token learning method is able to achieve significantly improved test-time adaptation performance by up to 1.9%.
Paper Structure (17 sections, 8 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 17 sections, 8 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: An illustration of our proposed bi-level visual conditioning token learning. (b): The adaptation process of the domain-specific visual conditioning token (DS-VCT), which aims to learn domain-specific information for all samples in the same domain. (c): The adaptation process of the instance-specific visual conditioning token (IS-VCT), which aims to learn instance-specific information for each target sample based on DS-VCT.
  • Figure 2: An overview of the proposed VCT method. During inference in the target domain, the class token which consists of the domain-specific visual conditioning token (DS-VCT) and the instance-specific visual conditioning token (IS-VCT), is updated before making a prediction given each mini-batch testing sample. The IS-VCT and its gradient are reset after each prediction (Left). The gradient flow in the multi-layer transformer encoder (Center). The details of each encoder layer (Right).
  • Figure 3: The visualization of the VCT in the adaptation process for different domains of the ImageNet-C dataset. (a): The t-SNE of the VCT for different domains during test-time adaptation, which shows the VCT can learn domain-specific information; (b): The VCT of two classes for different domains during the adaptation process. For the same domain, the VCT cluster within similar zones, which shows the VCT remains relatively invariant across different categories.
  • Figure 4: The token comparison of the Source, VCT, and Oracle methods in Gaussian Noise and Motion Blur of ImageNet-C dataset. Our learned VCT is close to the Oracle which is learned by label supervision.
  • Figure 5: The cosine similarity between the label-supervised Oracle token and our distinct token configurations.
  • ...and 1 more figures