Prompt-based Visual Alignment for Zero-shot Policy Transfer

Haihan Gao; Rui Zhang; Qi Yi; Hantao Yao; Haochen Li; Jiaming Guo; Shaohui Peng; Yunkai Gao; QiCheng Wang; Xing Hu; Yuanbo Wen; Zihao Zhang; Zidong Du; Ling Li; Qi Guo; Yunji Chen

Prompt-based Visual Alignment for Zero-shot Policy Transfer

Haihan Gao, Rui Zhang, Qi Yi, Hantao Yao, Haochen Li, Jiaming Guo, Shaohui Peng, Yunkai Gao, QiCheng Wang, Xing Hu, Yuanbo Wen, Zihao Zhang, Zidong Du, Ling Li, Qi Guo, Yunji Chen

TL;DR

This work tackles RL policy generalization under domain shifts by introducing Prompt-based Visual Alignment (PVA), which leverages a Visual-Language Model to impose semantic constraints on cross-domain image representations. The method learns a prompt-based description framework with global, domain-specific, and instance-conditional components to guide a visual aligner that maps multi-domain observations into a unified domain. A three-stage pipeline—Prompt Tuning, Visual Alignment, and Robust Policy Optimization—uses CLIP-based embeddings and multiple losses (global, patch, feature) to achieve strong zero-shot generalization with limited cross-domain data, demonstrated on CARLA driving tasks. The results show PVA outperforms traditional representation learning and image-translation baselines, reducing data requirements while enhancing robustness to unseen domains, with implications for reliable vision-based RL in real-world shifts.

Abstract

Overfitting in RL has become one of the main obstacles to applications in reinforcement learning(RL). Existing methods do not provide explicit semantic constrain for the feature extractor, hindering the agent from learning a unified cross-domain representation and resulting in performance degradation on unseen domains. Besides, abundant data from multiple domains are needed. To address these issues, in this work, we propose prompt-based visual alignment (PVA), a robust framework to mitigate the detrimental domain bias in the image for zero-shot policy transfer. Inspired that Visual-Language Model (VLM) can serve as a bridge to connect both text space and image space, we leverage the semantic information contained in a text sequence as an explicit constraint to train a visual aligner. Thus, the visual aligner can map images from multiple domains to a unified domain and achieve good generalization performance. To better depict semantic information, prompt tuning is applied to learn a sequence of learnable tokens. With explicit constraints of semantic information, PVA can learn unified cross-domain representation under limited access to cross-domain data and achieves great zero-shot generalization ability in unseen domains. We verify PVA on a vision-based autonomous driving task with CARLA simulator. Experiments show that the agent generalizes well on unseen domains under limited access to multi-domain data.

Prompt-based Visual Alignment for Zero-shot Policy Transfer

TL;DR

Abstract

Paper Structure (22 sections, 11 equations, 8 figures, 7 tables)

This paper contains 22 sections, 11 equations, 8 figures, 7 tables.

Introduction
Related Work
Preliminary
Prompt-based Visual Aligner
Prompt Tuning
Prompt Design
Tuning the learnable parts of prompt
Visual Alignment
Robust Policy Training
Experiment
Setup
Design of the model and details of training
Experiment Result
Comparison with existing methods
Comparision with other baselines with the same number of images in training
...and 7 more sections

Figures (8)

Figure 1: Input embedding distributions for the policy before (a) and after (b) domain alignment. Different colors represent different domains, where ClearNoon and HardRainNoon are applied in the training set. (a) Latent features generated by LUSR demonstrate severe out-of-distribution phenomena between seen and unseen domains (such as ClearSunset). (b) our approach mitigates the domain bias across various domains and well aligns the latent distributions between training and testing domains.
Figure 2: Overview of Prompt-based Visual Aligner(PVA).There are two key components in our methods. Prompt Learner $f_{PL}$ to obtain a learnable prompt from the input image. The visual aligner $g_\theta$ will transfer the image from one domain to another domain via the semantic information contained in the learnable prompt. Then the agent applies the transferred image to train a robust policy. In the illustration, we use to indicate the network is frozen and represents the network will be updated during training.
Figure 3: Illustration of different domains. ClearNoon and HardRainNoon are used to tune the prompt and train the visual aligner. We validate the agent's performance on WetCloudySunset, ClearSunset, and SoftRainSunset, which do not appear in the training stage.
Figure 4: Visualization distributions of different domains' latent embeddings, which serve as the input of the RL agent. Different colors are used to represent various domains respectively. We observe our approach alleviates the domain bias between domains, which makes the agent generalize under detrimental observation differences.
Figure 5: Compare the visual transfer results of our methods with other image-to-image approaches. The images generated by our method is illustrated in the leftmost sub-figure. You can identify the road line from the transferred images, which is essential for the control task.
...and 3 more figures

Prompt-based Visual Alignment for Zero-shot Policy Transfer

TL;DR

Abstract

Prompt-based Visual Alignment for Zero-shot Policy Transfer

Authors

TL;DR

Abstract

Table of Contents

Figures (8)