Table of Contents
Fetching ...

Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation

Senwei Xie, Hongyu Wang, Zhanqi Xiao, Ruiping Wang, Xilin Chen

TL;DR

RoboPro presents a robotic foundation model that translates visual observations and free-form instructions into executable policy code, enabling zero-shot manipulation across diverse robots and environments. It introduces Video2Code, a scalable data-curation pipeline that converts in-the-wild videos into 115k robot-runtime code examples, facilitating training without hand-authored data. Through a vision-encoder–code-LLM architecture and three-stage training, RoboPro achieves state-of-the-art zero-shot performance on RLBench and LIBERO, outperforming GPT-4o and approaching supervised baselines while robustly generalizing to API format changes and unseen skills. The work demonstrates that incorporating procedural knowledge from instructional videos into training significantly enhances visual reasoning and policy execution in robotics, with strong implications for scalable, real-world deployment.

Abstract

Zero-shot generalization across various robots, tasks and environments remains a significant challenge in robotic manipulation. Policy code generation methods use executable code to connect high-level task descriptions and low-level action sequences, leveraging the generalization capabilities of large language models and atomic skill libraries. In this work, we propose Robotic Programmer (RoboPro), a robotic foundation model, enabling the capability of perceiving visual information and following free-form instructions to perform robotic manipulation with policy code in a zero-shot manner. To address low efficiency and high cost in collecting runtime code data for robotic tasks, we devise Video2Code to synthesize executable code from extensive videos in-the-wild with off-the-shelf vision-language model and code-domain large language model. Extensive experiments show that RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments. Specifically, the zero-shot success rate of RoboPro on RLBench surpasses the state-of-the-art model GPT-4o by 11.6%, which is even comparable to a strong supervised training baseline. Furthermore, RoboPro is robust to variations on API formats and skill sets.

Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation

TL;DR

RoboPro presents a robotic foundation model that translates visual observations and free-form instructions into executable policy code, enabling zero-shot manipulation across diverse robots and environments. It introduces Video2Code, a scalable data-curation pipeline that converts in-the-wild videos into 115k robot-runtime code examples, facilitating training without hand-authored data. Through a vision-encoder–code-LLM architecture and three-stage training, RoboPro achieves state-of-the-art zero-shot performance on RLBench and LIBERO, outperforming GPT-4o and approaching supervised baselines while robustly generalizing to API format changes and unseen skills. The work demonstrates that incorporating procedural knowledge from instructional videos into training significantly enhances visual reasoning and policy execution in robotics, with strong implications for scalable, real-world deployment.

Abstract

Zero-shot generalization across various robots, tasks and environments remains a significant challenge in robotic manipulation. Policy code generation methods use executable code to connect high-level task descriptions and low-level action sequences, leveraging the generalization capabilities of large language models and atomic skill libraries. In this work, we propose Robotic Programmer (RoboPro), a robotic foundation model, enabling the capability of perceiving visual information and following free-form instructions to perform robotic manipulation with policy code in a zero-shot manner. To address low efficiency and high cost in collecting runtime code data for robotic tasks, we devise Video2Code to synthesize executable code from extensive videos in-the-wild with off-the-shelf vision-language model and code-domain large language model. Extensive experiments show that RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments. Specifically, the zero-shot success rate of RoboPro on RLBench surpasses the state-of-the-art model GPT-4o by 11.6%, which is even comparable to a strong supervised training baseline. Furthermore, RoboPro is robust to variations on API formats and skill sets.
Paper Structure (35 sections, 4 equations, 8 figures, 8 tables)

This paper contains 35 sections, 4 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Visualization of evaluation tasks and execution results. RoboPro shows impressive zero-shot performance on novel and compositional tasks in RLBench (a), long-termed manipulation tasks in LIBERO (b), and real-world tasks (c). Video demos can be found in our supplementary materials.
  • Figure 2: The data curation pipeline of Video2Code. We first use the Draft VLM to extract a brief natural language plan for execution of the user instruction. After that, the Code LLM generates robot-centric code using the provided API library and natural language plan from the first stage.
  • Figure 3: The overview of RoboPro. RoboPro utilizes environmental observation and natural language instruction as multimodal input, then outputs executable policy code. Extendable API library plays a role in mapping policy code into low-level execution sequences.
  • Figure 4: Error breakdown on RLBench.
  • Figure 5: Success rate on manipulation tasks across varying data proportions.
  • ...and 3 more figures