MobileManiBench: Simplifying Model Verification for Mobile Manipulation
Wenbo Wang, Fangyun Wei, QiXiu Li, Xi Chen, Yaobo Liang, Chang Xu, Jiaolong Yang, Baining Guo
TL;DR
This work addresses the scalability gap in vision-language-action (VLA) research for mobile manipulation by introducing MobileManiBench, a high-fidelity, simulation-based benchmark built on NVIDIA Isaac Sim and PPO-based RL. It defines two mobile platforms (G1 with a parallel gripper and XHand with a dexterous hand), 630 objects across 20 categories, five manipulation skills, and 100+ tasks, yielding around $300{,}000$ annotated trajectories across 100 realistic scenes. The methodology comprises three stages: universal MobileManiRL policy training for robot–object–skill triplets, MobileManiDataset generation with rich multi-modal data, and MobileManiVLA training to build a universal VLA model that generalizes to unseen objects and scenes using a diffusion-transformer-based action module and multi-view RGB-D inputs. Experiments demonstrate the platform’s effectiveness in benchmarking VLA models, revealing the benefits of multi-view and state-aware inputs, the value of mobile-base mobility for non-tabletop tasks, and the challenges of unseen-object generalization, thereby accelerating data-efficient, generalizable embodied AI research. MobileManiBench thus offers a reproducible, scalable testbed that supports rapid iteration and fair comparisons across VLA architectures and robot embodiments, with practical implications for deploying robust mobile manipulation systems.
Abstract
Vision-language-action models have advanced robotic manipulation but remain constrained by reliance on the large, teleoperation-collected datasets dominated by the static, tabletop scenes. We propose a simulation-first framework to verify VLA architectures before real-world deployment and introduce MobileManiBench, a large-scale benchmark for mobile-based robotic manipulation. Built on NVIDIA Isaac Sim and powered by reinforcement learning, our pipeline autonomously generates diverse manipulation trajectories with rich annotations (language instructions, multi-view RGB-depth-segmentation images, synchronized object/robot states and actions). MobileManiBench features 2 mobile platforms (parallel-gripper and dexterous-hand robots), 2 synchronized cameras (head and right wrist), 630 objects in 20 categories, 5 skills (open, close, pull, push, pick) with over 100 tasks performed in 100 realistic scenes, yielding 300K trajectories. This design enables controlled, scalable studies of robot embodiments, sensing modalities, and policy architectures, accelerating research on data efficiency and generalization. We benchmark representative VLA models and report insights into perception, reasoning, and control in complex simulated environments.
