Table of Contents
Fetching ...

MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation

Yejin Kim, Wilbert Pumacay, Omar Rayyan, Max Argus, Winson Han, Eli VanderBilt, Jordi Salvador, Abhay Deshpande, Rose Hendrix, Snehal Jauhri, Shuo Liu, Nur Muhammad Mahi Shafiullah, Maya Guru, Ainaz Eftekhar, Karen Farley, Donovan Clay, Jiafei Duan, Arjun Guru, Piper Wolters, Alvaro Herrasti, Ying-Chun Lee, Georgia Chalvatzaki, Yuchen Cui, Ali Farhadi, Dieter Fox, Ranjay Krishna

TL;DR

MolmoSpaces addresses the challenge of evaluating robot policies under real-world long-tail variability by providing a large-scale, simulator-agnostic open ecosystem of 230k scenes, 130k objects, and 42M grasps, along with a zero-shot 8-task benchmark (MolmoSpaces-Bench). It unifies scenes, objects, grasps, robots, and tooling across MuJoCo, IsaacSim, and ManiSkill, enabling scalable data generation and cross-simulator evaluation for navigation and manipulation. The study demonstrates strong sim-to-real correlation and analyzes distributional robustness, uncovering prompt sensitivity and occlusion vulnerabilities that guide further policy improvements. Overall, MolmoSpaces offers a practical foundation for scalable robotics learning research, with open assets and tooling to accelerate development of generalist policies.

Abstract

Deploying robots at scale demands robustness to the long tail of everyday situations. The countless variations in scene layout, object geometry, and task specifications that characterize real environments are vast and underrepresented in existing robot benchmarks. Measuring this level of generalization requires infrastructure at a scale and diversity that physical evaluation alone cannot provide. We introduce MolmoSpaces, a fully open ecosystem to support large-scale benchmarking of robot policies. MolmoSpaces consists of over 230k diverse indoor environments, ranging from handcrafted household scenes to procedurally generated multiroom houses, populated with 130k richly annotated object assets, including 48k manipulable objects with 42M stable grasps. Crucially, these environments are simulator-agnostic, supporting popular options such as MuJoCo, Isaac, and ManiSkill. The ecosystem supports the full spectrum of embodied tasks: static and mobile manipulation, navigation, and multiroom long-horizon tasks requiring coordinated perception, planning, and interaction across entire indoor environments. We also design MolmoSpaces-Bench, a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects. Our experiments show MolmoSpaces-Bench exhibits strong sim-to-real correlation (R = 0.96, \r{ho} = 0.98), confirm newer and stronger zero-shot policies outperform earlier versions in our benchmarks, and identify key sensitivities to prompt phrasing, initial joint positions, and camera occlusion. Through MolmoSpaces and its open-source assets and tooling, we provide a foundation for scalable data generation, policy training, and benchmark creation for robot learning research.

MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation

TL;DR

MolmoSpaces addresses the challenge of evaluating robot policies under real-world long-tail variability by providing a large-scale, simulator-agnostic open ecosystem of 230k scenes, 130k objects, and 42M grasps, along with a zero-shot 8-task benchmark (MolmoSpaces-Bench). It unifies scenes, objects, grasps, robots, and tooling across MuJoCo, IsaacSim, and ManiSkill, enabling scalable data generation and cross-simulator evaluation for navigation and manipulation. The study demonstrates strong sim-to-real correlation and analyzes distributional robustness, uncovering prompt sensitivity and occlusion vulnerabilities that guide further policy improvements. Overall, MolmoSpaces offers a practical foundation for scalable robotics learning research, with open assets and tooling to accelerate development of generalist policies.

Abstract

Deploying robots at scale demands robustness to the long tail of everyday situations. The countless variations in scene layout, object geometry, and task specifications that characterize real environments are vast and underrepresented in existing robot benchmarks. Measuring this level of generalization requires infrastructure at a scale and diversity that physical evaluation alone cannot provide. We introduce MolmoSpaces, a fully open ecosystem to support large-scale benchmarking of robot policies. MolmoSpaces consists of over 230k diverse indoor environments, ranging from handcrafted household scenes to procedurally generated multiroom houses, populated with 130k richly annotated object assets, including 48k manipulable objects with 42M stable grasps. Crucially, these environments are simulator-agnostic, supporting popular options such as MuJoCo, Isaac, and ManiSkill. The ecosystem supports the full spectrum of embodied tasks: static and mobile manipulation, navigation, and multiroom long-horizon tasks requiring coordinated perception, planning, and interaction across entire indoor environments. We also design MolmoSpaces-Bench, a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects. Our experiments show MolmoSpaces-Bench exhibits strong sim-to-real correlation (R = 0.96, \r{ho} = 0.98), confirm newer and stronger zero-shot policies outperform earlier versions in our benchmarks, and identify key sensitivities to prompt phrasing, initial joint positions, and camera occlusion. Through MolmoSpaces and its open-source assets and tooling, we provide a foundation for scalable data generation, policy training, and benchmark creation for robot learning research.
Paper Structure (28 sections, 24 figures, 7 tables)

This paper contains 28 sections, 24 figures, 7 tables.

Figures (24)

  • Figure 1: MolmoSpaces is an open ecosystem consisting of a large number of simulation environments, 3D articulated objects, and tasks for training and evaluating robot navigation and manipulation at scale. It provides object metadata, grasps, and tooling to generate training data, create benchmarks, and evaluate policies in a manner that correlates with real-world performance.
  • Figure 2: Examples of diverse scene environments from MolmoSpaces-Scenes-MultiType with the Filament rendering engine. Our ecosystem contains scenes from art studies, cat cafes, lounges, museums, and many other scenes, all pre-populated with objects to be manipulated.
  • Figure 3: An example scene rendered across different simulators: MuJoCo, Issac Sim, and ManiSkill. When using MuJoCo, the scenes can be rendered using either the OpenGL renderer (Classic) or with Filament (Filament).
  • Figure 4: A random sampling of object types in our ecosystem, with different sizes, shapes, and articulations. These examples are rendered with Filament.
  • Figure 5: Our grasp generation pipeline consists of separate streams for rigid and articulated assets. We generate 42M+ verified grasps that can be utilized to create scripted interaction policies. Grasps can be used in different simulation environments, with an Issac example shown on the right.
  • ...and 19 more figures