Table of Contents
Fetching ...

RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation

Junting Chen, Yao Mu, Qiaojun Yu, Tianming Wei, Silang Wu, Zhecheng Yuan, Zhixuan Liang, Chao Yang, Kaipeng Zhang, Wenqi Shao, Yu Qiao, Huazhe Xu, Mingyu Ding, Ping Luo

TL;DR

The paper addresses the challenge of turning high-level language instructions into deployable robot manipulation code that works on both real hardware and simulation. It introduces RoboScript, a ROS-based pipeline that integrates perception, planning, and control with LLM-driven code generation, validated through a Gazebo/MoveIt-based benchmark and live deployments on Franka and UR5 arms. The RoboScript Benchmark evaluates LLMs on physical-space reasoning, perception reliability, and sim-to-real transfer, with ablations showing perception quality critically impacts planning and execution. Results indicate GPT-4 outperforms GPT-3.5 and Gemini in code correctness, while object geometry meaningfully influences grasping success, highlighting practical challenges in open-world manipulation. Overall, RoboScript demonstrates end-to-end AI-powered robotic programming from natural language to executable control, offering a platform for rapid prototyping and real-world deployment with clearer paths for future reliability and scalability.

Abstract

Rapid progress in high-level task planning and code generation for open-world robot manipulation has been witnessed in Embodied AI. However, previous studies put much effort into general common sense reasoning and task planning capabilities of large-scale language or multi-modal models, relatively little effort on ensuring the deployability of generated code on real robots, and other fundamental components of autonomous robot systems including robot perception, motion planning, and control. To bridge this ``ideal-to-real'' gap, this paper presents \textbf{RobotScript}, a platform for 1) a deployable robot manipulation pipeline powered by code generation; and 2) a code generation benchmark for robot manipulation tasks in free-form natural language. The RobotScript platform addresses this gap by emphasizing the unified interface with both simulation and real robots, based on abstraction from the Robot Operating System (ROS), ensuring syntax compliance and simulation validation with Gazebo. We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms, and multiple grippers. Additionally, our benchmark assesses reasoning abilities for physical space and constraints, highlighting the differences between GPT-3.5, GPT-4, and Gemini in handling complex physical interactions. Finally, we present a thorough evaluation on the whole system, exploring how each module in the pipeline: code generation, perception, motion planning, and even object geometric properties, impact the overall performance of the system.

RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation

TL;DR

The paper addresses the challenge of turning high-level language instructions into deployable robot manipulation code that works on both real hardware and simulation. It introduces RoboScript, a ROS-based pipeline that integrates perception, planning, and control with LLM-driven code generation, validated through a Gazebo/MoveIt-based benchmark and live deployments on Franka and UR5 arms. The RoboScript Benchmark evaluates LLMs on physical-space reasoning, perception reliability, and sim-to-real transfer, with ablations showing perception quality critically impacts planning and execution. Results indicate GPT-4 outperforms GPT-3.5 and Gemini in code correctness, while object geometry meaningfully influences grasping success, highlighting practical challenges in open-world manipulation. Overall, RoboScript demonstrates end-to-end AI-powered robotic programming from natural language to executable control, offering a platform for rapid prototyping and real-world deployment with clearer paths for future reliability and scalability.

Abstract

Rapid progress in high-level task planning and code generation for open-world robot manipulation has been witnessed in Embodied AI. However, previous studies put much effort into general common sense reasoning and task planning capabilities of large-scale language or multi-modal models, relatively little effort on ensuring the deployability of generated code on real robots, and other fundamental components of autonomous robot systems including robot perception, motion planning, and control. To bridge this ``ideal-to-real'' gap, this paper presents \textbf{RobotScript}, a platform for 1) a deployable robot manipulation pipeline powered by code generation; and 2) a code generation benchmark for robot manipulation tasks in free-form natural language. The RobotScript platform addresses this gap by emphasizing the unified interface with both simulation and real robots, based on abstraction from the Robot Operating System (ROS), ensuring syntax compliance and simulation validation with Gazebo. We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms, and multiple grippers. Additionally, our benchmark assesses reasoning abilities for physical space and constraints, highlighting the differences between GPT-3.5, GPT-4, and Gemini in handling complex physical interactions. Finally, we present a thorough evaluation on the whole system, exploring how each module in the pipeline: code generation, perception, motion planning, and even object geometric properties, impact the overall performance of the system.
Paper Structure (34 sections, 3 equations, 15 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 3 equations, 15 figures, 4 tables, 1 algorithm.

Figures (15)

  • Figure 1: Framework of RoboScript . The input layer contains sensor input, human instruction, and robotic URDF (Unified Robot Description Format) data. The system utilizes various perception tools, such as grasp detection, 2D grounding, 3D perception, and joint prediction, to interpret the input data. These tools are integrated with motion planning tools that include arm planning, gripper force control, and solving place pose with inverse kinematics (IK). The Robot Operating System (ROS) serves as the middleware to provide abstraction to sensor drivers, controllers, and robot definitions across real robots and the simulation. The framework controls multiple real robots and their counterparts in the simulation with a unified code generation pipeline. This modular approach enables flexibility in robotic applications and adaptability to new code generation methods or robot architectures, from simple tasks to complex manipulations.
  • Figure 2: Pipeline of RoboScript. RoboScript uses Perception and Motion Planning Tools, activated by a task query, to generate a Python script with the LLM. This script includes comments, processes images into 3D models, and plans safe robot movements. Perception Tools identify objects and spatial details for planning. Motion Planning Tools then create a collision-free path for actions.
  • Figure 3: Cross-view bounding box matching score computation. The matching score between two bounding boxes is calculated by the Intersection over Union (IOU) of points presence binary vector.
  • Figure 4: Benchmark tabletop environment and tasks.
  • Figure 5: MoveIt planning scene. There are two modes for running MoveIt planning scene in the benchmark. MoveIt can either load ground truth mesh data from Gazebo, or build an Octomap from sensors in real time and load reconstructed meshes from the perception pipeline. planning scene
  • ...and 10 more figures