Table of Contents
Fetching ...

Efficient Policy Adaptation with Contrastive Prompt Ensemble for Embodied Agents

Wonje Choi, Woo Kyung Kim, SeungHyun Kim, Honguk Woo

TL;DR

This work tackles zero-shot policy adaptation for embodied agents facing diverse and non-stationary visual domain changes. It introduces ConPE, a framework that combines prompt-based contrastive learning with a guided-attention-based ensemble over a pool of visual prompts derived from the CLIP vision-language model, producing a robust state representation $Z= z_0+\sum_i \omega_i z_i$ that can generalize across domain factors. The approach jointly trains the attention module and the policy to maximize task performance across seen and unseen domains, achieving superior zero-shot performance and improved sample efficiency on AI2THOR, egocentric-Metaworld, and CARLA. By enabling rapid adaptation with a compact prompt ensemble and interpretable attention weights, ConPE offers a practical pathway for robust embodied RL in visually varied environments, and its modular design invites extensions with semantic and multi-modal cues. Overall, ConPE demonstrates that structured visual prompting and attention-guided fusion can substantially enhance zero-shot transfer in embodied agents while maintaining data efficiency.$

Abstract

For embodied reinforcement learning (RL) agents interacting with the environment, it is desirable to have rapid policy adaptation to unseen visual observations, but achieving zero-shot adaptation capability is considered as a challenging problem in the RL context. To address the problem, we present a novel contrastive prompt ensemble (ConPE) framework which utilizes a pretrained vision-language model and a set of visual prompts, thus enabling efficient policy learning and adaptation upon a wide range of environmental and physical changes encountered by embodied agents. Specifically, we devise a guided-attention-based ensemble approach with multiple visual prompts on the vision-language model to construct robust state representations. Each prompt is contrastively learned in terms of an individual domain factor that significantly affects the agent's egocentric perception and observation. For a given task, the attention-based ensemble and policy are jointly learned so that the resulting state representations not only generalize to various domains but are also optimized for learning the task. Through experiments, we show that ConPE outperforms other state-of-the-art algorithms for several embodied agent tasks including navigation in AI2THOR, manipulation in egocentric-Metaworld, and autonomous driving in CARLA, while also improving the sample efficiency of policy learning and adaptation.

Efficient Policy Adaptation with Contrastive Prompt Ensemble for Embodied Agents

TL;DR

This work tackles zero-shot policy adaptation for embodied agents facing diverse and non-stationary visual domain changes. It introduces ConPE, a framework that combines prompt-based contrastive learning with a guided-attention-based ensemble over a pool of visual prompts derived from the CLIP vision-language model, producing a robust state representation that can generalize across domain factors. The approach jointly trains the attention module and the policy to maximize task performance across seen and unseen domains, achieving superior zero-shot performance and improved sample efficiency on AI2THOR, egocentric-Metaworld, and CARLA. By enabling rapid adaptation with a compact prompt ensemble and interpretable attention weights, ConPE offers a practical pathway for robust embodied RL in visually varied environments, and its modular design invites extensions with semantic and multi-modal cues. Overall, ConPE demonstrates that structured visual prompting and attention-guided fusion can substantially enhance zero-shot transfer in embodied agents while maintaining data efficiency.$

Abstract

For embodied reinforcement learning (RL) agents interacting with the environment, it is desirable to have rapid policy adaptation to unseen visual observations, but achieving zero-shot adaptation capability is considered as a challenging problem in the RL context. To address the problem, we present a novel contrastive prompt ensemble (ConPE) framework which utilizes a pretrained vision-language model and a set of visual prompts, thus enabling efficient policy learning and adaptation upon a wide range of environmental and physical changes encountered by embodied agents. Specifically, we devise a guided-attention-based ensemble approach with multiple visual prompts on the vision-language model to construct robust state representations. Each prompt is contrastively learned in terms of an individual domain factor that significantly affects the agent's egocentric perception and observation. For a given task, the attention-based ensemble and policy are jointly learned so that the resulting state representations not only generalize to various domains but are also optimized for learning the task. Through experiments, we show that ConPE outperforms other state-of-the-art algorithms for several embodied agent tasks including navigation in AI2THOR, manipulation in egocentric-Metaworld, and autonomous driving in CARLA, while also improving the sample efficiency of policy learning and adaptation.

Paper Structure

This paper contains 30 sections, 9 equations, 12 figures, 19 tables, 3 algorithms.

Figures (12)

  • Figure 1: Visual Domain Changes of Embodied Agents
  • Figure 2: $\textsc{ConPE}$ Framework. The CLIP visual encoder is enhanced offline via (i) prompt-based contrastive learning that generates the visual prompt pool, and a policy is learned online by (ii) guided-attention-based prompt ensemble that uses the prompt pool. In (iii) zero-shot deployment, the policy is immediately evaluated upon domain changes.
  • Figure 3: Guided-Attention-based Prompt Ensemble. The cosine similarity-guided attention module $\mathcal{G}$ yields task-specific state representations from multiple prompted embeddings and is learned with a policy network $\pi$.
  • Figure 4: Sample-efficiency of Prompt Ensemble-based Policy Learning for Object Navigation in AI2THOR. The x-axis represents the number of samples (timesteps) used for policy learning, while the y-axis represents the task success rate for zero-shot evaluation.
  • Figure 5: Prompt Ensemble Interpretability. In (a), the embeddings in the big circle are intra prompted embeddings obtained by varying domains within a domain factor, and the embeddings in the rectangle are inter prompted embeddings obtained by changing the visual prompts with aligned observation. The closely located intra prompted embeddings indicate the domain-invariant knowledge, while the inter prompted embeddings clustered by different visual prompts indicate the alignment between the visual prompts and the domain factors. In (b), each cell represents attention weight $\omega_i$ applied for prompted embedding $z_i$.
  • ...and 7 more figures