Table of Contents
Fetching ...

3rd Place Solution for VisDA 2021 Challenge -- Universally Domain Adaptive Image Recognition

Haojin Liao, Xiaolin Song, Sicheng Zhao, Shanghang Zhang, Xiangyu Yue, Xingxu Yao, Yueming Zhang, Tengfei Xing, Pengfei Xu, Qiang Wang

TL;DR

This work tackles universal domain adaptation in VisDA 2021 by combining a Transformer-based VOLO backbone with OVANet-inspired open-set handling and an adversarial domain discriminator. Key contributions include integrating VOLO-D3 as the feature extractor, adopting Token Labeling with VOLO-compatible augmentations, expanding the near-negative open-set classifiers, and introducing a gradient-reversal domain discriminator to align source and target distributions for known classes. Through a two-stage training regime and 5-crop inference, the approach achieves strong UniDA performance, placing 3rd on the VisDA 2021 leaderboard with ACC $48.49\%$ and AUROC $70.8\%$, and illustrating substantial gains from the combined architectural and training enhancements. The results demonstrate the effectiveness of transformer-based feature representations and explicit distribution alignment for open-world domain adaptation in large-scale, multi-class settings.

Abstract

The Visual Domain Adaptation (VisDA) 2021 Challenge calls for unsupervised domain adaptation (UDA) methods that can deal with both input distribution shift and label set variance between the source and target domains. In this report, we introduce a universal domain adaptation (UniDA) method by aggregating several popular feature extraction and domain adaptation schemes. First, we utilize VOLO, a Transformer-based architecture with state-of-the-art performance in several visual tasks, as the backbone to extract effective feature representations. Second, we modify the open-set classifier of OVANet to recognize the unknown class with competitive accuracy and robustness. As shown in the leaderboard, our proposed UniDA method ranks the 3rd place with 48.49% ACC and 70.8% AUROC in the VisDA 2021 Challenge.

3rd Place Solution for VisDA 2021 Challenge -- Universally Domain Adaptive Image Recognition

TL;DR

This work tackles universal domain adaptation in VisDA 2021 by combining a Transformer-based VOLO backbone with OVANet-inspired open-set handling and an adversarial domain discriminator. Key contributions include integrating VOLO-D3 as the feature extractor, adopting Token Labeling with VOLO-compatible augmentations, expanding the near-negative open-set classifiers, and introducing a gradient-reversal domain discriminator to align source and target distributions for known classes. Through a two-stage training regime and 5-crop inference, the approach achieves strong UniDA performance, placing 3rd on the VisDA 2021 leaderboard with ACC and AUROC , and illustrating substantial gains from the combined architectural and training enhancements. The results demonstrate the effectiveness of transformer-based feature representations and explicit distribution alignment for open-world domain adaptation in large-scale, multi-class settings.

Abstract

The Visual Domain Adaptation (VisDA) 2021 Challenge calls for unsupervised domain adaptation (UDA) methods that can deal with both input distribution shift and label set variance between the source and target domains. In this report, we introduce a universal domain adaptation (UniDA) method by aggregating several popular feature extraction and domain adaptation schemes. First, we utilize VOLO, a Transformer-based architecture with state-of-the-art performance in several visual tasks, as the backbone to extract effective feature representations. Second, we modify the open-set classifier of OVANet to recognize the unknown class with competitive accuracy and robustness. As shown in the leaderboard, our proposed UniDA method ranks the 3rd place with 48.49% ACC and 70.8% AUROC in the VisDA 2021 Challenge.

Paper Structure

This paper contains 14 sections, 1 equation, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Overview of our proposed solution with four main parts that cooperate with each other. First, the backbone VOLO is employed to extract features of both source and target images. And then these features are sent to the closed-set classifier, open-set Classifier, and domain discriminator, separately. The closed-set classifier is used to identify a possible known class while the open-set Classifier is used to determine whether the sample is known or unknown. Domain discriminator helps the backbone to match the feature distributions of the source and target data from adversarial training for the known class.