Table of Contents
Fetching ...

GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs

Pu Hua, Minghuan Liu, Annabella Macaluso, Yunfeng Lin, Weinan Zhang, Huazhe Xu, Lirui Wang

TL;DR

GenSim2 tackles the challenge of scalable robotic data generation and sim-to-real transfer for complex articulated tasks by combining coding multi-modal LLMs with planning and RL solvers to generate diverse tasks and demonstrations. It introduces a Proprioceptive Point-cloud Transformer (PPT) policy that learns from simulated demonstrations and transfers zero-shot to the real world, achieving notable gains when combined with real data. The framework demonstrates generation of over 100 articulated tasks across hundreds of object instances and shows a 20% improvement when co-training with real-world data, highlighting reduced data collection burden and improved policy performance. The work advances scalable, realistic robotic data pipelines and offers a practical pathway toward more generalizable sim-to-real robotics systems.

Abstract

Robotic simulation today remains challenging to scale up due to the human efforts required to create diverse simulation tasks and scenes. Simulation-trained policies also face scalability issues as many sim-to-real methods focus on a single task. To address these challenges, this work proposes GenSim2, a scalable framework that leverages coding LLMs with multi-modal and reasoning capabilities for complex and realistic simulation task creation, including long-horizon tasks with articulated objects. To automatically generate demonstration data for these tasks at scale, we propose planning and RL solvers that generalize within object categories. The pipeline can generate data for up to 100 articulated tasks with 200 objects and reduce the required human efforts. To utilize such data, we propose an effective multi-task language-conditioned policy architecture, dubbed proprioceptive point-cloud transformer (PPT), that learns from the generated demonstrations and exhibits strong sim-to-real zero-shot transfer. Combining the proposed pipeline and the policy architecture, we show a promising usage of GenSim2 that the generated data can be used for zero-shot transfer or co-train with real-world collected data, which enhances the policy performance by 20% compared with training exclusively on limited real data.

GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs

TL;DR

GenSim2 tackles the challenge of scalable robotic data generation and sim-to-real transfer for complex articulated tasks by combining coding multi-modal LLMs with planning and RL solvers to generate diverse tasks and demonstrations. It introduces a Proprioceptive Point-cloud Transformer (PPT) policy that learns from simulated demonstrations and transfers zero-shot to the real world, achieving notable gains when combined with real data. The framework demonstrates generation of over 100 articulated tasks across hundreds of object instances and shows a 20% improvement when co-training with real-world data, highlighting reduced data collection burden and improved policy performance. The work advances scalable, realistic robotic data pipelines and offers a practical pathway toward more generalizable sim-to-real robotics systems.

Abstract

Robotic simulation today remains challenging to scale up due to the human efforts required to create diverse simulation tasks and scenes. Simulation-trained policies also face scalability issues as many sim-to-real methods focus on a single task. To address these challenges, this work proposes GenSim2, a scalable framework that leverages coding LLMs with multi-modal and reasoning capabilities for complex and realistic simulation task creation, including long-horizon tasks with articulated objects. To automatically generate demonstration data for these tasks at scale, we propose planning and RL solvers that generalize within object categories. The pipeline can generate data for up to 100 articulated tasks with 200 objects and reduce the required human efforts. To utilize such data, we propose an effective multi-task language-conditioned policy architecture, dubbed proprioceptive point-cloud transformer (PPT), that learns from the generated demonstrations and exhibits strong sim-to-real zero-shot transfer. Combining the proposed pipeline and the policy architecture, we show a promising usage of GenSim2 that the generated data can be used for zero-shot transfer or co-train with real-world collected data, which enhances the policy performance by 20% compared with training exclusively on limited real data.
Paper Structure (32 sections, 16 figures, 8 tables)

This paper contains 32 sections, 16 figures, 8 tables.

Figures (16)

  • Figure 1: GenSim2 introduces a scalable task and data generation framework in SAPIEN Xiang_2020_SAPIEN for articulation objects with multi-modal and reasoning LLMs. The framework comprises three main stages: 1) (Top) We first generate large-scale robotics tasks and collect massive data with LLMs; 2) (Bottom-left) Then, we train a multi-task point cloud-based policy in simulation with imitation learning; 3) (Bottom-right) Finally, we zero-shot transfer the policy to the real world.
  • Figure 2: Overview of GenSim2 framework. The pipeline consists of (1) task proposal, (2) solver creation, (3) multi-task training, and (4) generalization evaluation and sim-to-real transfer.
  • Figure 3: Multi-modal Task Generation Pipeline.GenSim2 first is prompted to generate the task code, given few-shot examples and available assets. It then renders the scene image and the keypoint information of the task assets is fed into the MLLM model to generate a planner config for the actuation pose. The actuation pose is then extended to actuation motions, which are fed into GPT-4V for inspections. In this example, the pipeline produces a task motion for opening the box.
  • Figure 4: The proposed Proprioception Point-cloud Transformer (PPT) policy architecture maps language, point cloud, and proprioception inputs in a shared latent space for action prediction. The policy action head supports various architectures from diffusion chi2023diffusion to transformer decoder zhao2023learning.
  • Figure 5: Ablation study on components of our generation pipeline. All results are based on no less than 10 generations. Left) We use various types of LLMs for solver generation and find that multi-modal LLM (GPT-4V) outperforms the others. Middle) We find that the performance of our method will increase, as we increase the maximum iteration for reject sampling. Right) We observe that splitting solver generation into a prompt chain surpasses generating the whole solver config.
  • ...and 11 more figures