Table of Contents
Fetching ...

HomeRobot: Open-Vocabulary Mobile Manipulation

Sriram Yenamandra, Arun Ramachandran, Karmesh Yadav, Austin Wang, Mukul Khanna, Theophile Gervet, Tsung-Yen Yang, Vidhi Jain, Alexander William Clegg, John Turner, Zsolt Kira, Manolis Savva, Angel Chang, Devendra Singh Chaplot, Dhruv Batra, Roozbeh Mottaghi, Yonatan Bisk, Chris Paxton

TL;DR

The paper defines Open-Vocabulary Mobile Manipulation (OVMM) and introduces HomeRobot OVMM, a reproducible benchmark with simulation (Habitat-based) and real-world components (Hello Robot Stretch). It provides a unified library (HomeRobot) and baselines (heuristic and RL) to study end-to-end perception, navigation, and manipulation in multi-room homes, showing sim-to-real transfer with notable real-world challenges. Key findings show RL can outperform heuristic in some tasks but suffers under noisy open-vocabulary perception, while heuristic methods offer robustness to perception errors; overall, performance remains limited, indicating substantial opportunities for improving perception, planning, and integration. The work emphasizes reproducibility and sets the stage for future end-to-end baselines, richer language, and more complex, real-world evaluation.

Abstract

HomeRobot (noun): An affordable compliant robot that navigates homes and manipulates a wide range of objects in order to complete everyday tasks. Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location. This is a foundational challenge for robots to be useful assistants in human environments, because it involves tackling sub-problems from across robotics: perception, language understanding, navigation, and manipulation are all essential to OVMM. In addition, integration of the solutions to these sub-problems poses its own substantial challenges. To drive research in this area, we introduce the HomeRobot OVMM benchmark, where an agent navigates household environments to grasp novel objects and place them on target receptacles. HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments; and a real-world component, providing a software stack for the low-cost Hello Robot Stretch to encourage replication of real-world experiments across labs. We implement both reinforcement learning and heuristic (model-based) baselines and show evidence of sim-to-real transfer. Our baselines achieve a 20% success rate in the real world; our experiments identify ways future research work improve performance. See videos on our website: https://ovmm.github.io/.

HomeRobot: Open-Vocabulary Mobile Manipulation

TL;DR

The paper defines Open-Vocabulary Mobile Manipulation (OVMM) and introduces HomeRobot OVMM, a reproducible benchmark with simulation (Habitat-based) and real-world components (Hello Robot Stretch). It provides a unified library (HomeRobot) and baselines (heuristic and RL) to study end-to-end perception, navigation, and manipulation in multi-room homes, showing sim-to-real transfer with notable real-world challenges. Key findings show RL can outperform heuristic in some tasks but suffers under noisy open-vocabulary perception, while heuristic methods offer robustness to perception errors; overall, performance remains limited, indicating substantial opportunities for improving perception, planning, and integration. The work emphasizes reproducibility and sets the stage for future end-to-end baselines, richer language, and more complex, real-world evaluation.

Abstract

HomeRobot (noun): An affordable compliant robot that navigates homes and manipulates a wide range of objects in order to complete everyday tasks. Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location. This is a foundational challenge for robots to be useful assistants in human environments, because it involves tackling sub-problems from across robotics: perception, language understanding, navigation, and manipulation are all essential to OVMM. In addition, integration of the solutions to these sub-problems poses its own substantial challenges. To drive research in this area, we introduce the HomeRobot OVMM benchmark, where an agent navigates household environments to grasp novel objects and place them on target receptacles. HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments; and a real-world component, providing a software stack for the low-cost Hello Robot Stretch to encourage replication of real-world experiments across labs. We implement both reinforcement learning and heuristic (model-based) baselines and show evidence of sim-to-real transfer. Our baselines achieve a 20% success rate in the real world; our experiments identify ways future research work improve performance. See videos on our website: https://ovmm.github.io/.
Paper Structure (50 sections, 3 equations, 22 figures, 8 tables)

This paper contains 50 sections, 3 equations, 22 figures, 8 tables.

Figures (22)

  • Figure 1: Open-Vocabulary Mobile Manipulation requires agents to search for a previously unseen object at a particular location, and move it to the correct receptacle.
  • Figure 2: A low-cost home robot performing tasks in both a simulated and a real-world environment. We provide both (1) challenging simulated tasks, wherein a mobile manipulator robot must find and grasp multiple seen and unseen objects, and (2) a corresponding real-world robotics stack to allow others to reproduce this research and evaluation to produce useful home robot assistants.
  • Figure 3: HSSD scenes.
  • Figure 4: HomeRobot is a simple, easy-to-set-up library that works in multiple environments and requires only relatively affordable hardware. Computationally intensive operations are performed on a desktop PC with a GPU, and a dedicated consumer-grade router provides a network interface to a robot running low-level control and SLAM.
  • Figure 5: A few success and failure cases for our simple grasping policy under the new grasp success condition that requires the agent's arm to reach near the object without colliding. The agent resorts to sideways grasps when the object can't be reached via a top-down grasp that bends the gripper. Most grasping failures are because of the collisions with the scene.
  • ...and 17 more figures