A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

Shivansh Patel; Xinchen Yin; Wenlong Huang; Shubham Garg; Hooshang Nayyeri; Li Fei-Fei; Svetlana Lazebnik; Yunzhu Li

A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

Shivansh Patel, Xinchen Yin, Wenlong Huang, Shubham Garg, Hooshang Nayyeri, Li Fei-Fei, Svetlana Lazebnik, Yunzhu Li

TL;DR

The paper tackles the challenge of flexible task specification for open-world robotic manipulation by introducing Iterative Keypoint Reward (IKER), a VLM-generated, visually grounded reward function that operates on 3D keypoints to enable precise SE(3) control. IKER is integrated into a real-to-sim-to-real loop: real scenes are reconstructed into simulation using BundleSDF meshes and FoundationPose, policies are trained with domain randomization in simulation, and the learned policies are deployed back in the real world with IK-based control and vision-based pose estimation. The authors demonstrate IKER's effectiveness across diverse tasks, showing improved multi-step task execution, error recovery, and on-the-fly replanning compared to baselines like VoxPoser and pose-based rewards. Domain randomization is shown to enhance real-world robustness, while limitations include mesh capture requirements, simplified dynamics, and limited multi-object interactions. Overall, IKER offers a scalable, adaptable framework for goal-conditioned robotic manipulation guided by vision-language feedback.

Abstract

Task specification for robotic manipulation in open-world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python-based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world-forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping.

A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

TL;DR

Abstract

A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)