Table of Contents
Fetching ...

RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, Lunkai Lin, Zhiqiang Xie, Mingyu Ding, Ping Luo

TL;DR

RoboTwin tackles the scarcity of diverse, real-world demonstrations for dual-arm manipulation by generating a generative digital twin pipeline that converts 2D images into rich 3D assets with spatial annotations and LLM-driven task decomposition. It combines synthetic data with real-world teleoperation data to create a standard benchmark and real-world-aligned evaluation, enabling scalable policy learning and evaluation. The method relies on 3D generative foundation models and LLM-based code generation to produce diverse tasks and executable trajectories via spatial constraints, validated on the COBOT Magic Platform and ManiSkill3. Experiments show that pretraining with RoboTwin-generated data and limited real data yields substantial improvements over real-data-only training, indicating effective sim-to-real transfer and highlighting remaining gaps in complex dual-arm coordination.

Abstract

In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples demonstrate significant potential for enhancing dual-arm robotic manipulation systems by improving success rates by over 70% for single-arm tasks and over 40% for dual-arm tasks compared to models trained solely on real-world data.

RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

TL;DR

RoboTwin tackles the scarcity of diverse, real-world demonstrations for dual-arm manipulation by generating a generative digital twin pipeline that converts 2D images into rich 3D assets with spatial annotations and LLM-driven task decomposition. It combines synthetic data with real-world teleoperation data to create a standard benchmark and real-world-aligned evaluation, enabling scalable policy learning and evaluation. The method relies on 3D generative foundation models and LLM-based code generation to produce diverse tasks and executable trajectories via spatial constraints, validated on the COBOT Magic Platform and ManiSkill3. Experiments show that pretraining with RoboTwin-generated data and limited real data yields substantial improvements over real-data-only training, indicating effective sim-to-real transfer and highlighting remaining gaps in complex dual-arm coordination.

Abstract

In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples demonstrate significant potential for enhancing dual-arm robotic manipulation systems by improving success rates by over 70% for single-arm tasks and over 40% for dual-arm tasks compared to models trained solely on real-world data.

Paper Structure

This paper contains 27 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: RoboTwin Benchmark. A framework leveraging generative foundational models to generate realistic and interactive training scenarios and diverse expert demonstrations for benchmarking dual-arm robotic manipulation.
  • Figure 2: Real-to-simulation transfer and expert data generation. We first leverage a 3D generative foundation model to create diverse 3D assets from 2D images, complete with geometry, normals, and textures. This process is augmented by vision-language models to generate variations of object descriptions, enabling the creation of visually diverse yet functionally consistent 3D models. We then implement a spatial annotation framework that marks key functional and contact points, along with functional, approach, and lateral axes on these 3D assets. Finally, we employ LLMs to generate expert demonstrations by decomposing tasks into subtasks, inferring spatial constraints, and generating collision-free robot behavior executable code that satisfies kinematic requirements.
  • Figure 3: Examples of spatial annotations. Function and contact points with principal axes for functional parts and approach directions are extracted semi-automatically within RoboTwin for spatial- and geometry-aware manipulation and code generation.
  • Figure 4: Illustration of our robot platform, with the capabilities for teleoperation and data acquisition.
  • Figure 5: Success rate of the generated code for RoboTwin benchmark.
  • ...and 4 more figures