Table of Contents
Fetching ...

MobileManiBench: Simplifying Model Verification for Mobile Manipulation

Wenbo Wang, Fangyun Wei, QiXiu Li, Xi Chen, Yaobo Liang, Chang Xu, Jiaolong Yang, Baining Guo

TL;DR

This work addresses the scalability gap in vision-language-action (VLA) research for mobile manipulation by introducing MobileManiBench, a high-fidelity, simulation-based benchmark built on NVIDIA Isaac Sim and PPO-based RL. It defines two mobile platforms (G1 with a parallel gripper and XHand with a dexterous hand), 630 objects across 20 categories, five manipulation skills, and 100+ tasks, yielding around $300{,}000$ annotated trajectories across 100 realistic scenes. The methodology comprises three stages: universal MobileManiRL policy training for robot–object–skill triplets, MobileManiDataset generation with rich multi-modal data, and MobileManiVLA training to build a universal VLA model that generalizes to unseen objects and scenes using a diffusion-transformer-based action module and multi-view RGB-D inputs. Experiments demonstrate the platform’s effectiveness in benchmarking VLA models, revealing the benefits of multi-view and state-aware inputs, the value of mobile-base mobility for non-tabletop tasks, and the challenges of unseen-object generalization, thereby accelerating data-efficient, generalizable embodied AI research. MobileManiBench thus offers a reproducible, scalable testbed that supports rapid iteration and fair comparisons across VLA architectures and robot embodiments, with practical implications for deploying robust mobile manipulation systems.

Abstract

Vision-language-action models have advanced robotic manipulation but remain constrained by reliance on the large, teleoperation-collected datasets dominated by the static, tabletop scenes. We propose a simulation-first framework to verify VLA architectures before real-world deployment and introduce MobileManiBench, a large-scale benchmark for mobile-based robotic manipulation. Built on NVIDIA Isaac Sim and powered by reinforcement learning, our pipeline autonomously generates diverse manipulation trajectories with rich annotations (language instructions, multi-view RGB-depth-segmentation images, synchronized object/robot states and actions). MobileManiBench features 2 mobile platforms (parallel-gripper and dexterous-hand robots), 2 synchronized cameras (head and right wrist), 630 objects in 20 categories, 5 skills (open, close, pull, push, pick) with over 100 tasks performed in 100 realistic scenes, yielding 300K trajectories. This design enables controlled, scalable studies of robot embodiments, sensing modalities, and policy architectures, accelerating research on data efficiency and generalization. We benchmark representative VLA models and report insights into perception, reasoning, and control in complex simulated environments.

MobileManiBench: Simplifying Model Verification for Mobile Manipulation

TL;DR

This work addresses the scalability gap in vision-language-action (VLA) research for mobile manipulation by introducing MobileManiBench, a high-fidelity, simulation-based benchmark built on NVIDIA Isaac Sim and PPO-based RL. It defines two mobile platforms (G1 with a parallel gripper and XHand with a dexterous hand), 630 objects across 20 categories, five manipulation skills, and 100+ tasks, yielding around annotated trajectories across 100 realistic scenes. The methodology comprises three stages: universal MobileManiRL policy training for robot–object–skill triplets, MobileManiDataset generation with rich multi-modal data, and MobileManiVLA training to build a universal VLA model that generalizes to unseen objects and scenes using a diffusion-transformer-based action module and multi-view RGB-D inputs. Experiments demonstrate the platform’s effectiveness in benchmarking VLA models, revealing the benefits of multi-view and state-aware inputs, the value of mobile-base mobility for non-tabletop tasks, and the challenges of unseen-object generalization, thereby accelerating data-efficient, generalizable embodied AI research. MobileManiBench thus offers a reproducible, scalable testbed that supports rapid iteration and fair comparisons across VLA architectures and robot embodiments, with practical implications for deploying robust mobile manipulation systems.

Abstract

Vision-language-action models have advanced robotic manipulation but remain constrained by reliance on the large, teleoperation-collected datasets dominated by the static, tabletop scenes. We propose a simulation-first framework to verify VLA architectures before real-world deployment and introduce MobileManiBench, a large-scale benchmark for mobile-based robotic manipulation. Built on NVIDIA Isaac Sim and powered by reinforcement learning, our pipeline autonomously generates diverse manipulation trajectories with rich annotations (language instructions, multi-view RGB-depth-segmentation images, synchronized object/robot states and actions). MobileManiBench features 2 mobile platforms (parallel-gripper and dexterous-hand robots), 2 synchronized cameras (head and right wrist), 630 objects in 20 categories, 5 skills (open, close, pull, push, pick) with over 100 tasks performed in 100 realistic scenes, yielding 300K trajectories. This design enables controlled, scalable studies of robot embodiments, sensing modalities, and policy architectures, accelerating research on data efficiency and generalization. We benchmark representative VLA models and report insights into perception, reasoning, and control in complex simulated environments.
Paper Structure (24 sections, 10 equations, 19 figures, 13 tables)

This paper contains 24 sections, 10 equations, 19 figures, 13 tables.

Figures (19)

  • Figure 1: Overview of MobileManiBench. It features 2 mobile-based robots: the G1 robot with a parallel gripper and the XHand robot with a dexterous hand. The benchmark includes 630 articulated and holistic objects across 20 categories and supports 5 mobile manipulation skills—open, close, pull, push, and pick—enabling over 100 tasks. To efficiently scale data generation while ensuring task success, we train a universal MobileManiRL policy for each robot–object–skill triplet and generate MobileManiDataset across 100 realistic scenes with 300K trajectories and 3 data modalities—language instructions, multi-view RGB–depth–segmentation images, synchronized object/robot states and actions. MobileManiBench offers a flexible testbed to accelerate model innovation and data-efficiency research for VLA models.
  • Figure 2: Definitions of the robot gripper/hand points (blue), object grasp point (red), and goal point (green) across diverse tasks.
  • Figure 3: Illustrations of simplified scenes for MobileManiRL training and realistic scenes for MobileManiDataset generation and MobileManiVLA evaluation.
  • Figure 4: Success rates of MobileManiRL and MobileManiVLA on the G1 robot and XHand robot across 20 object categories and 5 mobile manipulation skills. In terms of manipulation motion patterns, objects like box, laptop, and oven require lid-flipping upward or downward; microwave, car, fridge, and round door require door-handle grasping followed by pivoting left or right; faucet requires handle rotation; table and cart require handle grasping with pulling or pushing; holistic (ycb) objects require object grasping and lifting.
  • Figure 5: Initialization of the robot, object, ground, and table.
  • ...and 14 more figures