Table of Contents
Fetching ...

Domain-Conditioned Transformer for Fully Test-time Adaptation

Yushun Tang, Shuoshuo Chen, Jiyuan Jia, Yi Zhang, Zhihai He

TL;DR

This work proposes a new structure for the self-attention modules in the transformer that incorporates three domain-conditioning vectors, called domain conditioners, into the query, key, and value components of the self-attention module and finds that these domain conditioners are able to gradually remove the impact of domain shift and largely recover the original self-attention profile.

Abstract

Fully test-time adaptation aims to adapt a network model online based on sequential analysis of input samples during the inference stage. We observe that, when applying a transformer network model into a new domain, the self-attention profiles of image samples in the target domain deviate significantly from those in the source domain, which results in large performance degradation during domain changes. To address this important issue, we propose a new structure for the self-attention modules in the transformer. Specifically, we incorporate three domain-conditioning vectors, called domain conditioners, into the query, key, and value components of the self-attention module. We learn a network to generate these three domain conditioners from the class token at each transformer network layer. We find that, during fully online test-time adaptation, these domain conditioners at each transform network layer are able to gradually remove the impact of domain shift and largely recover the original self-attention profile. Our extensive experimental results demonstrate that the proposed domain-conditioned transformer significantly improves the online fully test-time domain adaptation performance and outperforms existing state-of-the-art methods by large margins.

Domain-Conditioned Transformer for Fully Test-time Adaptation

TL;DR

This work proposes a new structure for the self-attention modules in the transformer that incorporates three domain-conditioning vectors, called domain conditioners, into the query, key, and value components of the self-attention module and finds that these domain conditioners are able to gradually remove the impact of domain shift and largely recover the original self-attention profile.

Abstract

Fully test-time adaptation aims to adapt a network model online based on sequential analysis of input samples during the inference stage. We observe that, when applying a transformer network model into a new domain, the self-attention profiles of image samples in the target domain deviate significantly from those in the source domain, which results in large performance degradation during domain changes. To address this important issue, we propose a new structure for the self-attention modules in the transformer. Specifically, we incorporate three domain-conditioning vectors, called domain conditioners, into the query, key, and value components of the self-attention module. We learn a network to generate these three domain conditioners from the class token at each transformer network layer. We find that, during fully online test-time adaptation, these domain conditioners at each transform network layer are able to gradually remove the impact of domain shift and largely recover the original self-attention profile. Our extensive experimental results demonstrate that the proposed domain-conditioned transformer significantly improves the online fully test-time domain adaptation performance and outperforms existing state-of-the-art methods by large margins.

Paper Structure

This paper contains 16 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Visualization of output class tokens across various layers of our adapted ViT-B/16 network in ImageNet-C dataset. Different corruptions represent different domains. In layer 1, the features exhibit domain-separability and class-inseparability due to the presence of domain shift, with a considerable distance between domains and a small distance between classes. Our DCT method effectively mitigates the influence of domain shift over successive layers. Consequently, the domain distance decreases while the class distance increases, leading to the features domain-inseparable yet class-separable across the layers of the Domain-Conditioned Transformer.
  • Figure 2: Size of the attended area by transformer network depth. Each dot on the figure represents the mean attention distance calculated across 128 example images, considering all heads at a specific layer.
  • Figure 3: An overview of the proposed DCT method. During inference in the target domain, the domain conditioners generator $\Phi^{l}$ and LN layers are updated before making a prediction given each mini-batch testing sample. The domain-conditioned transformer (Left). The details of the self-attention head in each layer (Right).
  • Figure 4: Visualization of the domain conditioners after the entire adaptation process for different domains in ImageNet-C from the first vision transformer layers.
  • Figure 5: Visualization of output class tokens from different vision transformer layers. The first 5 plots show the features from various layers of our DCT, and the last plot shows the features of the source model for comparison.
  • ...and 2 more figures