Table of Contents
Fetching ...

VirDA: Reusing Backbone for Unsupervised Domain Adaptation with Visual Reprogramming

Duy Nguyen, Dat Nguyen

TL;DR

VirDA tackles the inefficiency of traditional unsupervised domain adaptation by reusing a frozen backbone and introducing domain-specific visual reprogramming layers that prepend input prompts to shift style and texture toward a shared representation. It couples these prompts with domain-specific classifiers and a dual-objective training regime that enforces inter-domain alignment and intra-domain robustness without altering backbone weights. Empirical results across Digits, Office-31, and Office-Home demonstrate competitive accuracy with dramatically fewer trainable parameters and reduced storage, outperforming several PEFT and full-finetuning baselines in many settings. The work highlights the practicality of texture-aware prompting for cross-domain transfer and lays groundwork for extending the approach to other vision tasks.

Abstract

Existing UDA pipelines fine-tune already well-trained backbone parameters for every new source-and-target pair, resulting in the number of training parameters and storage memory growing linearly with each new pair, and also preventing the reuse of these well-trained backbone parameters. Inspired by recent implications that existing backbones have textural biases, we propose making use of domain-specific textural bias for domain adaptation via visual reprogramming, namely VirDA. Instead of fine-tuning the full backbone, VirDA prepends a domain-specific visual reprogramming layer to the backbone. This layer produces visual prompts that act as an added textural bias to the input image, adapting its "style" to a target domain. To optimize these visual reprogramming layers, we use multiple objective functions that optimize the intra- and inter-domain distribution differences when domain-adapting visual prompts are applied. This process does not require modifying the backbone parameters, allowing the same backbone to be reused across different domains. We evaluate VirDA on Office-31 and obtain 92.8% mean accuracy with only 1.5M trainable parameters. VirDA surpasses PDA, the state-of-the-art parameter-efficient UDA baseline, by +1.6% accuracy while using just 46% of its parameters. Compared with full-backbone fine-tuning, VirDA outperforms CDTrans and FixBi by +0.2% and +1.4%, respectively, while requiring only 1.7% and 2.8% of their trainable parameters. Relative to the strongest current methods (PMTrans and TVT), VirDA uses ~1.7% of their parameters and trades off only 2.2% and 1.1% accuracy, respectively.

VirDA: Reusing Backbone for Unsupervised Domain Adaptation with Visual Reprogramming

TL;DR

VirDA tackles the inefficiency of traditional unsupervised domain adaptation by reusing a frozen backbone and introducing domain-specific visual reprogramming layers that prepend input prompts to shift style and texture toward a shared representation. It couples these prompts with domain-specific classifiers and a dual-objective training regime that enforces inter-domain alignment and intra-domain robustness without altering backbone weights. Empirical results across Digits, Office-31, and Office-Home demonstrate competitive accuracy with dramatically fewer trainable parameters and reduced storage, outperforming several PEFT and full-finetuning baselines in many settings. The work highlights the practicality of texture-aware prompting for cross-domain transfer and lays groundwork for extending the approach to other vision tasks.

Abstract

Existing UDA pipelines fine-tune already well-trained backbone parameters for every new source-and-target pair, resulting in the number of training parameters and storage memory growing linearly with each new pair, and also preventing the reuse of these well-trained backbone parameters. Inspired by recent implications that existing backbones have textural biases, we propose making use of domain-specific textural bias for domain adaptation via visual reprogramming, namely VirDA. Instead of fine-tuning the full backbone, VirDA prepends a domain-specific visual reprogramming layer to the backbone. This layer produces visual prompts that act as an added textural bias to the input image, adapting its "style" to a target domain. To optimize these visual reprogramming layers, we use multiple objective functions that optimize the intra- and inter-domain distribution differences when domain-adapting visual prompts are applied. This process does not require modifying the backbone parameters, allowing the same backbone to be reused across different domains. We evaluate VirDA on Office-31 and obtain 92.8% mean accuracy with only 1.5M trainable parameters. VirDA surpasses PDA, the state-of-the-art parameter-efficient UDA baseline, by +1.6% accuracy while using just 46% of its parameters. Compared with full-backbone fine-tuning, VirDA outperforms CDTrans and FixBi by +0.2% and +1.4%, respectively, while requiring only 1.7% and 2.8% of their trainable parameters. Relative to the strongest current methods (PMTrans and TVT), VirDA uses ~1.7% of their parameters and trades off only 2.2% and 1.1% accuracy, respectively.

Paper Structure

This paper contains 12 sections, 14 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Notably, our method excels over other parameter-efficient fine-tuning methods using CLIP as the backbone (e.g., PDA and MaPLe), as well as other methods that require full fine-tuning (e.g., FixBi and CDTrans) at minimal computation cost. Moreover, VirDA required only 1.7% training parameters (1.5M to 86.6M) while sacrificing 2.2% accuracy compared to the SoTA method.
  • Figure 2: The overall pipeline of VirDA.
  • Figure 3: Samples and reprogram masks visualization of classes using VirDA transfer from the source domain (Product) to the mild target domain (Realworld) and the hard target domain (Clipart). These classes are chosen from the classes where our method performs worst and best in Clipart, indicated by the class-wise accuracy of VirDA.
  • Figure 4: Visualization of the original image, the reprogrammed mask before (upper row) and after (lower row) UDA task on Rw$\to$Pr. The source domain masks focus on encoding the surrounding areas, while the target domain masks highlight the main object.