Table of Contents
Fetching ...

ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, Yuan Gao, Xuanlin Li, Tongzhou Mu, Nan Xiao, Arnav Gurha, Viswesh Nagaswamy Rajesh, Yong Woo Choi, Yen-Ru Chen, Zhiao Huang, Roberto Calandra, Rui Chen, Shan Luo, Hao Su

TL;DR

ManiSkill3 addresses the data- and compute-hungry challenge of generalizable robotic manipulation by delivering a GPU-accelerated, highly scalable simulator with heterogeneous environments, an intuitive API, VR teleoperation, and robust sim2real/real2sim capabilities. It demonstrates unprecedented GPU-accelerated throughput (up to 30k+ FPS) and dramatically lower memory footprints, enabling large-scale visual RL and offline/online imitation learning with diverse task categories and robots. The framework provides comprehensive baselines (PPO, TD-MPC2, BC, diffusion policies, PerAct, VLA models) and demonstration pipelines, plus digital-twin oriented evaluation to bridge simulation and real-world performance. Overall, ManiSkill3 lowers barriers to scaling embodied AI research, supports rapid surrogates for real-world transfer, and invites community contributions through its open-source design and extensive documentation.

Abstract

Simulation has enabled unprecedented compute-scalable approaches to robot learning. However, many existing simulation frameworks typically support a narrow range of scenes/tasks and lack features critical for scaling generalizable robotics and sim2real. We introduce and open source ManiSkill3, the fastest state-visual GPU parallelized robotics simulator with contact-rich physics targeting generalizable manipulation. ManiSkill3 supports GPU parallelization of many aspects including simulation+rendering, heterogeneous simulation, pointclouds/voxels visual input, and more. Simulation with rendering on ManiSkill3 can run 10-1000x faster with 2-3x less GPU memory usage than other platforms, achieving up to 30,000+ FPS in benchmarked environments due to minimal python/pytorch overhead in the system, simulation on the GPU, and the use of the SAPIEN parallel rendering system. Tasks that used to take hours to train can now take minutes. We further provide the most comprehensive range of GPU parallelized environments/tasks spanning 12 distinct domains including but not limited to mobile manipulation for tasks such as drawing, humanoids, and dextrous manipulation in realistic scenes designed by artists or real-world digital twins. In addition, millions of demonstration frames are provided from motion planning, RL, and teleoperation. ManiSkill3 also provides a comprehensive set of baselines that span popular RL and learning-from-demonstrations algorithms.

ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI

TL;DR

ManiSkill3 addresses the data- and compute-hungry challenge of generalizable robotic manipulation by delivering a GPU-accelerated, highly scalable simulator with heterogeneous environments, an intuitive API, VR teleoperation, and robust sim2real/real2sim capabilities. It demonstrates unprecedented GPU-accelerated throughput (up to 30k+ FPS) and dramatically lower memory footprints, enabling large-scale visual RL and offline/online imitation learning with diverse task categories and robots. The framework provides comprehensive baselines (PPO, TD-MPC2, BC, diffusion policies, PerAct, VLA models) and demonstration pipelines, plus digital-twin oriented evaluation to bridge simulation and real-world performance. Overall, ManiSkill3 lowers barriers to scaling embodied AI research, supports rapid surrogates for real-world transfer, and invites community contributions through its open-source design and extensive documentation.

Abstract

Simulation has enabled unprecedented compute-scalable approaches to robot learning. However, many existing simulation frameworks typically support a narrow range of scenes/tasks and lack features critical for scaling generalizable robotics and sim2real. We introduce and open source ManiSkill3, the fastest state-visual GPU parallelized robotics simulator with contact-rich physics targeting generalizable manipulation. ManiSkill3 supports GPU parallelization of many aspects including simulation+rendering, heterogeneous simulation, pointclouds/voxels visual input, and more. Simulation with rendering on ManiSkill3 can run 10-1000x faster with 2-3x less GPU memory usage than other platforms, achieving up to 30,000+ FPS in benchmarked environments due to minimal python/pytorch overhead in the system, simulation on the GPU, and the use of the SAPIEN parallel rendering system. Tasks that used to take hours to train can now take minutes. We further provide the most comprehensive range of GPU parallelized environments/tasks spanning 12 distinct domains including but not limited to mobile manipulation for tasks such as drawing, humanoids, and dextrous manipulation in realistic scenes designed by artists or real-world digital twins. In addition, millions of demonstration frames are provided from motion planning, RL, and teleoperation. ManiSkill3 also provides a comprehensive set of baselines that span popular RL and learning-from-demonstrations algorithms.
Paper Structure (56 sections, 2 equations, 41 figures, 5 tables)

This paper contains 56 sections, 2 equations, 41 figures, 5 tables.

Figures (41)

  • Figure 1: Multiple distinct task categories are displayed, ranging from room-scale tasks to humanoid interactions and drawing tasks. Majority of tasks shown are GPU-parallelized, simulating + rendering at state-of-the-art speeds and GPU memory efficiency. Scenes are from ReplicaCAD and AI2-THOR.
  • Figure 2: GPU Simulation+Rendering of RGB speeds of the Cartpole environment with different camera setups ManiSkill3 and Isaac Lab. Annotated numbers indicate GPU memory usage, with no data points beyond 128 environments for Isaac Lab due to running out of GPU memory. Note that this rendering setting mimics that of real world datasets collected in Open-X and Droid. Speed is dependent on a few factors, primarily the number of objects, geometry complexity of each object, as well as simulation/rendering configurations which can be tuned for speed or accuracy. As a result, it is possible the numbers/trends here may not hold for every environment.
  • Figure 3: Comparison of ManiSkill3 (Top row) and Isaac Lab (Bottom row) parallel rendering 640x480 RGB and depth image outputs of the Cartpole benchmark task.
  • Figure 4: GPU Simulation+Rendering speeds of various tasks with a single 128x128 resolution camera with a simulation frequency of 120 and control frequency of 60, meaning the camera renders ever 2 sim steps. RGB, depth, and segmentation data are all simultaneously being rendered. The only big variations between environments of the three curves are the objects and robots being simulated.
  • Figure 5: Parallel rendering outputs of 1024 parallel environments for the StackCube and PushT tasks with a subset of 4 them visualized here. Original renders are size 128x128, images shown are up-scaled for clarity. Top-row shows camera pose randomization and bottom row shows texture randomization, renderered depth/segmentation data is not shown here.
  • ...and 36 more figures