Table of Contents
Fetching ...

CDE: Concept-Driven Exploration for Reinforcement Learning

Le Mao, Andrew H. Liu, Renos Zabounidis, Zachary Kingston, Joseph Campbell

TL;DR

CDE addresses the challenge of visual RL exploration under sparse rewards by using a pre-trained vision-language model to generate object-centric concepts from task descriptions. The policy learns to reconstruct these concepts via a concept-embedding framework, and the reconstruction error serves as an intrinsic reward to drive targeted exploration, while the VLM is only required during training. The approach yields robust, object-centric exploration across five visual manipulation tasks and demonstrates sim-to-real transfer with a real Franka arm, achieving up to 80% real-world success. By incorporating dual object representations for visible and non-visible states, CDE remains effective with wrist-mounted cameras and shows resilience to noisy VLM outputs, offering a practical path to deployment without online VLM dependence.

Abstract

Intelligent exploration remains a critical challenge in reinforcement learning (RL), especially in visual control tasks. Unlike low-dimensional state-based RL, visual RL must extract task-relevant structure from raw pixels, making exploration inefficient. We propose Concept-Driven Exploration (CDE), which leverages a pre-trained vision-language model (VLM) to generate object-centric visual concepts from textual task descriptions as weak, potentially noisy supervisory signals. Rather than directly conditioning on these noisy signals, CDE trains a policy to reconstruct the concepts via an auxiliary objective, using reconstruction accuracy as an intrinsic reward to guide exploration toward task-relevant objects. Because the policy internalizes these concepts, VLM queries are only needed during training, reducing dependence on external models during deployment. Across five challenging simulated visual manipulation tasks, CDE achieves efficient, targeted exploration and remains robust to noisy VLM predictions. Finally, we demonstrate real-world transfer by deploying CDE on a Franka Research 3 arm, attaining an 80\% success rate in a real-world manipulation task.

CDE: Concept-Driven Exploration for Reinforcement Learning

TL;DR

CDE addresses the challenge of visual RL exploration under sparse rewards by using a pre-trained vision-language model to generate object-centric concepts from task descriptions. The policy learns to reconstruct these concepts via a concept-embedding framework, and the reconstruction error serves as an intrinsic reward to drive targeted exploration, while the VLM is only required during training. The approach yields robust, object-centric exploration across five visual manipulation tasks and demonstrates sim-to-real transfer with a real Franka arm, achieving up to 80% real-world success. By incorporating dual object representations for visible and non-visible states, CDE remains effective with wrist-mounted cameras and shows resilience to noisy VLM outputs, offering a practical path to deployment without online VLM dependence.

Abstract

Intelligent exploration remains a critical challenge in reinforcement learning (RL), especially in visual control tasks. Unlike low-dimensional state-based RL, visual RL must extract task-relevant structure from raw pixels, making exploration inefficient. We propose Concept-Driven Exploration (CDE), which leverages a pre-trained vision-language model (VLM) to generate object-centric visual concepts from textual task descriptions as weak, potentially noisy supervisory signals. Rather than directly conditioning on these noisy signals, CDE trains a policy to reconstruct the concepts via an auxiliary objective, using reconstruction accuracy as an intrinsic reward to guide exploration toward task-relevant objects. Because the policy internalizes these concepts, VLM queries are only needed during training, reducing dependence on external models during deployment. Across five challenging simulated visual manipulation tasks, CDE achieves efficient, targeted exploration and remains robust to noisy VLM predictions. Finally, we demonstrate real-world transfer by deploying CDE on a Franka Research 3 arm, attaining an 80\% success rate in a real-world manipulation task.

Paper Structure

This paper contains 21 sections, 5 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: Concept-Driven Exploration overview. Task-relevant objects are identified from a task description and then used by a VLM to generate segmentation masks for each object. These segmentation masks shape policy representation learning and guide exploration.
  • Figure 2: Architecture. (a) The LLM parses the task description to extract the target object. The VLM segments masks on input RGB images. (b) During training, the policy takes segmentation masks as additional input and generates intrinsic reward signals for the last timestep. (c) At each timestep $t+1$, the policy network receives environment observation $O_{t+1}$, and encodes $O_{t+1}$ into a positive embedding $\hat{\textbf{c}}^{+}(O_{t+1})$ and a negative embedding $\hat{\textbf{c}}^{-}(O_{t+1})$, the final concept embedding is a weighted sum of the two embeddings. (d) The segmentation masks $M_{t+1}$ are used to supervise the mask reconstruction from the positive embedding $\hat{\textbf{c}}^{+}(O_{t+1})$ and generate intrinsic reward signal $R_{t}^{\text{int}}$.
  • Figure 3: Environment setup. The agent is required to interact with the outlined target object to accomplish the task.
  • Figure 4: Examples masks under each noise setting. The resolution of both images and masks is 84 $\times$ 84, for VLM mask, we segment the mask on 320 $\times$ 320 RGB images and downsample to 84 $\times$ 84.
  • Figure 5: Simulated task results. All tasks are run across 10 random seeds and we report average success rate with standard error. (Top row) Learning with ground truth masks. (Bottom row) Learning with masks with synthetic noise. CDE shows better stability and robustness to noisy mask inputs than baselines.
  • ...and 4 more figures