Table of Contents
Fetching ...

OT-VP: Optimal Transport-guided Visual Prompting for Test-Time Adaptation

Yunbei Zhang, Akshay Mehra, Jihun Hamm

TL;DR

Optimal Transport-guided Test-Time Visual Prompting (OT-VP), leveraging prompt learning at test time to align the target and source domains without accessing the training process or altering pretrained model parameters.

Abstract

Vision Transformers (ViTs) have demonstrated remarkable capabilities in learning representations, but their performance is compromised when applied to unseen domains. Previous methods either engage in prompt learning during the training phase or modify model parameters at test time through entropy minimization. The former often overlooks unlabeled target data, while the latter doesn't fully address domain shifts. In this work, our approach, Optimal Transport-guided Test-Time Visual Prompting (OT-VP), handles these problems by leveraging prompt learning at test time to align the target and source domains without accessing the training process or altering pre-trained model parameters. This method involves learning a universal visual prompt for the target domain by optimizing the Optimal Transport distance.OT-VP, with only four learned prompt tokens, exceeds state-of-the-art performance across three stylistic datasets-PACS, VLCS, OfficeHome, and one corrupted dataset ImageNet-C. Additionally, OT-VP operates efficiently, both in terms of memory and computation, and is adaptable for extension to online settings.

OT-VP: Optimal Transport-guided Visual Prompting for Test-Time Adaptation

TL;DR

Optimal Transport-guided Test-Time Visual Prompting (OT-VP), leveraging prompt learning at test time to align the target and source domains without accessing the training process or altering pretrained model parameters.

Abstract

Vision Transformers (ViTs) have demonstrated remarkable capabilities in learning representations, but their performance is compromised when applied to unseen domains. Previous methods either engage in prompt learning during the training phase or modify model parameters at test time through entropy minimization. The former often overlooks unlabeled target data, while the latter doesn't fully address domain shifts. In this work, our approach, Optimal Transport-guided Test-Time Visual Prompting (OT-VP), handles these problems by leveraging prompt learning at test time to align the target and source domains without accessing the training process or altering pre-trained model parameters. This method involves learning a universal visual prompt for the target domain by optimizing the Optimal Transport distance.OT-VP, with only four learned prompt tokens, exceeds state-of-the-art performance across three stylistic datasets-PACS, VLCS, OfficeHome, and one corrupted dataset ImageNet-C. Additionally, OT-VP operates efficiently, both in terms of memory and computation, and is adaptable for extension to online settings.
Paper Structure (15 sections, 8 equations, 5 figures, 11 tables)

This paper contains 15 sections, 8 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Motivation of our approach. (a) An ERM model trained on the source domain struggles to adapt to the target domain due to domain shifts. (b) Our method (OT-VP) optimizes a visual prompt by minimizing the Optimal Transport distance to align the target distribution (indicated as an ellipse) with the source distribution without changing the decision boundary.
  • Figure 2: An overview of our proposed OT-VP method. At test time, unlabeled target data are processed through a frozen pre-trained ViT model with only the prompt tokens (indicated in red) being trainable. This generates target representations ($z^t$) and pseudo-labels ($\hat{y}^t$). We then align these with actual source labels ($y^s$) and offline-computed (the grey shadowed area) source representations ($z^s$) via Optimal Transport (OT) distance. The visual prompts are iteratively optimized based on this distance to align the source and target domain data more closely.
  • Figure 3: t-SNE visualization showcasing the impact of OT-VP. The figures display the representation space before and after the application of OT-VP for A $\rightarrow$ C in the PACS dataset with the pre-trained ViT encoder. Different numbers represent distinct class labels. (a) The initial state from ERM, as indicated in the left image, shows the target data points are not only distant from the source but also exhibit considerable class overlap, especially within the central region enclosed by the ellipse. This misalignment reflects an accuracy of $63.5\%$, an OT distance of $29.1$, and a prediction entropy of $0.54$. (b) After employing OT-VP, the right image shows that the target representations become more distinct and well-separated, with classes from source and target better aligned. The target data have shifted closer to the corresponding source representations, improving accuracy to $81.4\%$—an increase of 17.9%, and reducing the OT distance and prediction entropy to $25.8$ and $0.27$ respectively.
  • Figure 4: (a) Influence of hyperparameter $\lambda$ on ImageNet-C. Accuracy declines with smaller $\lambda$ values but stabilizes when $\lambda$ is large. These trends reveal the pivotal role of $\lambda$ in preventing cross-class transport and its impact on overfitting, particularly when using pseudo labels during prompt optimization. (b) Effect of Prompt Length on OT-VP Performance. The performance of OT-VP exhibits only minor variations across different numbers of prompts, showing a robust improvement.
  • Figure 5: Prediction entropy across TTA Algorithms in Single-Source and Multi-Source settings on PACS. In both settings, OT-VP demonstrates a marked reduction in entropy, outperforming Tent-C and Tent-BN, which target entropy minimization directly.