Table of Contents
Fetching ...

Affordance-Guided Reinforcement Learning via Visual Prompting

Olivia Y. Lee, Annie Xie, Kuan Fang, Karl Pertsch, Chelsea Finn

TL;DR

KAGI introduces dense, affordance-guided rewards derived from vision-language models to augment sparse task rewards, enabling more data-efficient autonomous RL for real-world robotic manipulation. By extracting zero-shot keypoints and waypoint trajectories from VLMs and fusing them with RoboFuME's pretraining, the method provides dense shaping rewards that guide online fine-tuning. Empirical results in simulation and on real robots show improved success rates and robustness, including resilience to reductions in in-domain demonstrations. The work demonstrates the potential of open-vocabulary visual prompting to enhance the efficiency and generalization of autonomous robotic learning, with future directions in object-centric tracking and offline RL improvements.

Abstract

Robots equipped with reinforcement learning (RL) have the potential to learn a wide range of skills solely from a reward signal. However, obtaining a robust and dense reward signal for general manipulation tasks remains a challenge. Existing learning-based approaches require significant data, such as human demonstrations of success and failure, to learn task-specific reward functions. Recently, there is also a growing adoption of large multi-modal foundation models for robotics that can perform visual reasoning in physical contexts and generate coarse robot motions for manipulation tasks. Motivated by this range of capability, in this work, we present Keypoint-based Affordance Guidance for Improvements (KAGI), a method leveraging rewards shaped by vision-language models (VLMs) for autonomous RL. State-of-the-art VLMs have demonstrated impressive zero-shot reasoning about affordances through keypoints, and we use these to define dense rewards that guide autonomous robotic learning. On diverse real-world manipulation tasks specified by natural language descriptions, KAGI improves the sample efficiency of autonomous RL and enables successful task completion in 30K online fine-tuning steps. Additionally, we demonstrate the robustness of KAGI to reductions in the number of in-domain demonstrations used for pre-training, reaching similar performance in 45K online fine-tuning steps. Project website: https://sites.google.com/view/affordance-guided-rl

Affordance-Guided Reinforcement Learning via Visual Prompting

TL;DR

KAGI introduces dense, affordance-guided rewards derived from vision-language models to augment sparse task rewards, enabling more data-efficient autonomous RL for real-world robotic manipulation. By extracting zero-shot keypoints and waypoint trajectories from VLMs and fusing them with RoboFuME's pretraining, the method provides dense shaping rewards that guide online fine-tuning. Empirical results in simulation and on real robots show improved success rates and robustness, including resilience to reductions in in-domain demonstrations. The work demonstrates the potential of open-vocabulary visual prompting to enhance the efficiency and generalization of autonomous robotic learning, with future directions in object-centric tracking and offline RL improvements.

Abstract

Robots equipped with reinforcement learning (RL) have the potential to learn a wide range of skills solely from a reward signal. However, obtaining a robust and dense reward signal for general manipulation tasks remains a challenge. Existing learning-based approaches require significant data, such as human demonstrations of success and failure, to learn task-specific reward functions. Recently, there is also a growing adoption of large multi-modal foundation models for robotics that can perform visual reasoning in physical contexts and generate coarse robot motions for manipulation tasks. Motivated by this range of capability, in this work, we present Keypoint-based Affordance Guidance for Improvements (KAGI), a method leveraging rewards shaped by vision-language models (VLMs) for autonomous RL. State-of-the-art VLMs have demonstrated impressive zero-shot reasoning about affordances through keypoints, and we use these to define dense rewards that guide autonomous robotic learning. On diverse real-world manipulation tasks specified by natural language descriptions, KAGI improves the sample efficiency of autonomous RL and enables successful task completion in 30K online fine-tuning steps. Additionally, we demonstrate the robustness of KAGI to reductions in the number of in-domain demonstrations used for pre-training, reaching similar performance in 45K online fine-tuning steps. Project website: https://sites.google.com/view/affordance-guided-rl
Paper Structure (15 sections, 6 figures, 3 tables)

This paper contains 15 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Keypoint-based Affordance Guidance for Improvements (KAGI) computes dense rewards defined using affordance-based keypoints and waypoints trajectories inferred by a VLM. Our dense reward formulation helps to shape learned behaviors, facilitating efficient online fine-tuning across diverse real-world tasks.
  • Figure 2: KAGI consists of two components. The first, represented by arrow labeled 1 above, leverages a VLM to select from a set of affordance keypoints, then generate a waypoint sequence. The second, represented by arrows labeled 2 above, involves per timestep reward computation for each frame in the episode replay buffer, computing dense reward with respect to the waypoint sequence and a sparse reward derived from a success classifier. The dense reward is used for online RL if the sparse reward is 0, else the sparse reward is used.
  • Figure 3: Example of annotated images to VLM (\ref{['fig:top-down-ex']}, \ref{['fig:side-ex']}) and VLM-generated trajectory (\ref{['fig:closest-block-ex']}). In \ref{['fig:top-down-ex']}, teal points labeled P1-5 denote grasp keypoints, blue points labeled Q1-5 denote target keypoints for the VLM to select from. In \ref{['fig:closest-block-ex']}, orange point is robot position, green tiles denote the generated trajectory, and arrows denote motion direction. KAGI's dense reward formulation encourages the robot to move to the next block, following the green arrow.
  • Figure 4: Tasks Visualization. Evaluation is conducted on four real-world tasks: Cloth Covering (Deformables Manipulation), Almond Sweeping (Non-Prehensile Manipulation), Spatula Pick-Place (Functional Grasping), and Cube Stacking (Precise Manipulation).
  • Figure 5: Average success across $3$ seeds on simulated Bin-Sorting. We evaluate each reward formulation under: the standard number of demos (left), $2\times$ reduction (middle), and $5\times$ reduction (right).
  • ...and 1 more figures