Table of Contents
Fetching ...

Visual Prompt Tuning for Test-time Domain Adaptation

Yunhe Gao, Xingjian Shi, Yi Zhu, Hao Wang, Zhiqiang Tang, Xiong Zhou, Mu Li, Dimitris N. Metaxas

TL;DR

This work tackles the challenge of test-time domain adaptation without access to source data by freezing a ViT backbone and learning a small set of visual prompts. It combines memory-bank based pseudo-labeling with a hierarchical self-supervised regularization framework tailored for prompts, enabling robust adaptation with few tunable parameters. Across VisDA-C, ImageNet-C, and DomainNet-126, DePT achieves state-of-the-art results and demonstrates strong data efficiency, including competitive performance with only 1% of target data and in online/multi-source settings. The approach offers a practical, parameter-efficient path for deploying domain-adaptive vision models in real-world scenarios.

Abstract

Models should be able to adapt to unseen data during test-time to avoid performance drops caused by inevitable distribution shifts in real-world deployment scenarios. In this work, we tackle the practical yet challenging test-time adaptation (TTA) problem, where a model adapts to the target domain without accessing the source data. We propose a simple recipe called \textit{Data-efficient Prompt Tuning} (DePT) with two key ingredients. First, DePT plugs visual prompts into the vision Transformer and only tunes these source-initialized prompts during adaptation. We find such parameter-efficient finetuning can efficiently adapt the model representation to the target domain without overfitting to the noise in the learning objective. Second, DePT bootstraps the source representation to the target domain by memory bank-based online pseudo-labeling. A hierarchical self-supervised regularization specially designed for prompts is jointly optimized to alleviate error accumulation during self-training. With much fewer tunable parameters, DePT demonstrates not only state-of-the-art performance on major adaptation benchmarks VisDA-C, ImageNet-C, and DomainNet-126, but also superior data efficiency, i.e., adaptation with only 1\% or 10\% data without much performance degradation compared to 100\% data. In addition, DePT is also versatile to be extended to online or multi-source TTA settings.

Visual Prompt Tuning for Test-time Domain Adaptation

TL;DR

This work tackles the challenge of test-time domain adaptation without access to source data by freezing a ViT backbone and learning a small set of visual prompts. It combines memory-bank based pseudo-labeling with a hierarchical self-supervised regularization framework tailored for prompts, enabling robust adaptation with few tunable parameters. Across VisDA-C, ImageNet-C, and DomainNet-126, DePT achieves state-of-the-art results and demonstrates strong data efficiency, including competitive performance with only 1% of target data and in online/multi-source settings. The approach offers a practical, parameter-efficient path for deploying domain-adaptive vision models in real-world scenarios.

Abstract

Models should be able to adapt to unseen data during test-time to avoid performance drops caused by inevitable distribution shifts in real-world deployment scenarios. In this work, we tackle the practical yet challenging test-time adaptation (TTA) problem, where a model adapts to the target domain without accessing the source data. We propose a simple recipe called \textit{Data-efficient Prompt Tuning} (DePT) with two key ingredients. First, DePT plugs visual prompts into the vision Transformer and only tunes these source-initialized prompts during adaptation. We find such parameter-efficient finetuning can efficiently adapt the model representation to the target domain without overfitting to the noise in the learning objective. Second, DePT bootstraps the source representation to the target domain by memory bank-based online pseudo-labeling. A hierarchical self-supervised regularization specially designed for prompts is jointly optimized to alleviate error accumulation during self-training. With much fewer tunable parameters, DePT demonstrates not only state-of-the-art performance on major adaptation benchmarks VisDA-C, ImageNet-C, and DomainNet-126, but also superior data efficiency, i.e., adaptation with only 1\% or 10\% data without much performance degradation compared to 100\% data. In addition, DePT is also versatile to be extended to online or multi-source TTA settings.
Paper Structure (24 sections, 11 equations, 5 figures, 10 tables)

This paper contains 24 sections, 11 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: The test-time adaptation performance of different methods with respect to the data ratio on the VisDA dataset. The number in the legend denotes the number of tunable parameters. DePT-G outperforms previous SOTA AdaContrast on 100% target data with only 0.19% tunable parameters. The superiority of DePT is more significant on the low data settings.
  • Figure 2: The overview of DePT. (A) We split ViT into multiple stages and prepend prompts to the input of each stage. The prompts, along with the backbone, are initialized with labeled source domain data. Only prompts and the classification head (in red) are finetuned during adaptation, while the backbone is frozen (in blue). (B) The proposed adaptation framework. The pseudo labels of target data are first predicted and then refined by memory bank-based soft voting for self-training. A hierarchical self-supervised objective is proposed to improve target representation and alleviate error accumulation of self-training.
  • Figure 3: (A) Performance comparison on VisDA-C dataset with different data ratio for TTA. DePT shows less performance degradation when the training data reduces. (B) t-SNE visualization of the DePT-G model before and after adaptation on VisDA-C dataset. Different color denotes classes.
  • Figure 4: The curve of average accuracy of evaluation v.s. training steps with 100% data and 1% data on VisDA-C dataset.
  • Figure 5: Our method is easy to extend to multi-source domain adaptation setting, where only the corresponding phase needs to be modified.