Table of Contents
Fetching ...

A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

Shivansh Patel, Xinchen Yin, Wenlong Huang, Shubham Garg, Hooshang Nayyeri, Li Fei-Fei, Svetlana Lazebnik, Yunzhu Li

TL;DR

The paper tackles the challenge of flexible task specification for open-world robotic manipulation by introducing Iterative Keypoint Reward (IKER), a VLM-generated, visually grounded reward function that operates on 3D keypoints to enable precise SE(3) control. IKER is integrated into a real-to-sim-to-real loop: real scenes are reconstructed into simulation using BundleSDF meshes and FoundationPose, policies are trained with domain randomization in simulation, and the learned policies are deployed back in the real world with IK-based control and vision-based pose estimation. The authors demonstrate IKER's effectiveness across diverse tasks, showing improved multi-step task execution, error recovery, and on-the-fly replanning compared to baselines like VoxPoser and pose-based rewards. Domain randomization is shown to enhance real-world robustness, while limitations include mesh capture requirements, simplified dynamics, and limited multi-object interactions. Overall, IKER offers a scalable, adaptable framework for goal-conditioned robotic manipulation guided by vision-language feedback.

Abstract

Task specification for robotic manipulation in open-world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python-based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world-forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping.

A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

TL;DR

The paper tackles the challenge of flexible task specification for open-world robotic manipulation by introducing Iterative Keypoint Reward (IKER), a VLM-generated, visually grounded reward function that operates on 3D keypoints to enable precise SE(3) control. IKER is integrated into a real-to-sim-to-real loop: real scenes are reconstructed into simulation using BundleSDF meshes and FoundationPose, policies are trained with domain randomization in simulation, and the learned policies are deployed back in the real world with IK-based control and vision-based pose estimation. The authors demonstrate IKER's effectiveness across diverse tasks, showing improved multi-step task execution, error recovery, and on-the-fly replanning compared to baselines like VoxPoser and pose-based rewards. Domain randomization is shown to enhance real-world robustness, while limitations include mesh capture requirements, simplified dynamics, and limited multi-object interactions. Overall, IKER offers a scalable, adaptable framework for goal-conditioned robotic manipulation guided by vision-language feedback.

Abstract

Task specification for robotic manipulation in open-world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python-based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world-forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping.

Paper Structure

This paper contains 19 sections, 1 equation, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Capabilities of Our Framework. IKER is designed to handle a wide range of real-world tasks. It can be seamlessly chained to execute multi-step tasks. It exhibits robustness to disturbances and demonstrates the ability to solve problems flexibly.
  • Figure 2: Framework Overview. Iterative Keypoint Reward (IKER) is a visually grounded reward generated by Vision-Language Models (VLMs) as task specification. The framework reconstructs the real-world scene in simulation, and the generated reward is used to train RL policies, which are subsequently deployed in the real-world.
  • Figure 3: Iterative Keypoint Reward Generation. This corresponds to the first step in Figure \ref{['fig:overview']}. We first obtain keypoints in the scene. These keypoints, combined with a human command and execution history, are processed by a VLM to generate code that maps keypoints to the reward function. A more detailed illustration of the keypoints and generated code is provided in Figure \ref{['fig:unrolled']}.
  • Figure 4: Setup and experiment objects. We use XArm7 to conduct all our experiments. Our setup includes 4 stationary and 1 wrist-mounted camera. We experiment with 5 shoe pairs and 2 shoe racks for tasks involving shoe scenarios. Additionally, we experiment with 9 different books for stowing tasks.
  • Figure 5: Scenarios demonstrating capabilities of our framework. The framework is robust to disturbances and can adapt in response to unexpected events. Additionally, it can propose new plans when the original ones become infeasible.
  • ...and 3 more figures