Table of Contents
Fetching ...

RoboTHOR: An Open Simulation-to-Real Embodied AI Platform

Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, Ali Farhadi

TL;DR

RoboTHOR introduces an open, modular platform pairing simulated embodied agents with physical robots to study and benchmark simulation-to-real transfer in indoor visual navigation. The authors define a semantic navigation task, implement multiple baselines, and evaluate sim-to-sim and sim-to-real transfers, revealing a significant performance drop when moving from simulation to reality due to appearance and control dynamics gaps. Key analyses show perceptual and sensor-domain misalignments, camera-parameter sensitivity, and the ineffectiveness of naive image-translation domain adaptation. The work emphasizes RoboTHOR's potential to democratize, reproduce, and accelerate research in embodied AI by offering remote, scalable benchmarking across sim and real environments.

Abstract

Visual recognition ecosystems (e.g. ImageNet, Pascal, COCO) have undeniably played a prevailing role in the evolution of modern computer vision. We argue that interactive and embodied visual AI has reached a stage of development similar to visual recognition prior to the advent of these ecosystems. Recently, various synthetic environments have been introduced to facilitate research in embodied AI. Notwithstanding this progress, the crucial question of how well models trained in simulation generalize to reality has remained largely unanswered. The creation of a comparable ecosystem for simulation-to-real embodied AI presents many challenges: (1) the inherently interactive nature of the problem, (2) the need for tight alignments between real and simulated worlds, (3) the difficulty of replicating physical conditions for repeatable experiments, (4) and the associated cost. In this paper, we introduce RoboTHOR to democratize research in interactive and embodied visual AI. RoboTHOR offers a framework of simulated environments paired with physical counterparts to systematically explore and overcome the challenges of simulation-to-real transfer, and a platform where researchers across the globe can remotely test their embodied models in the physical world. As a first benchmark, our experiments show there exists a significant gap between the performance of models trained in simulation when they are tested in both simulations and their carefully constructed physical analogs. We hope that RoboTHOR will spur the next stage of evolution in embodied computer vision. RoboTHOR can be accessed at the following link: https://ai2thor.allenai.org/robothor

RoboTHOR: An Open Simulation-to-Real Embodied AI Platform

TL;DR

RoboTHOR introduces an open, modular platform pairing simulated embodied agents with physical robots to study and benchmark simulation-to-real transfer in indoor visual navigation. The authors define a semantic navigation task, implement multiple baselines, and evaluate sim-to-sim and sim-to-real transfers, revealing a significant performance drop when moving from simulation to reality due to appearance and control dynamics gaps. Key analyses show perceptual and sensor-domain misalignments, camera-parameter sensitivity, and the ineffectiveness of naive image-translation domain adaptation. The work emphasizes RoboTHOR's potential to democratize, reproduce, and accelerate research in embodied AI by offering remote, scalable benchmarking across sim and real environments.

Abstract

Visual recognition ecosystems (e.g. ImageNet, Pascal, COCO) have undeniably played a prevailing role in the evolution of modern computer vision. We argue that interactive and embodied visual AI has reached a stage of development similar to visual recognition prior to the advent of these ecosystems. Recently, various synthetic environments have been introduced to facilitate research in embodied AI. Notwithstanding this progress, the crucial question of how well models trained in simulation generalize to reality has remained largely unanswered. The creation of a comparable ecosystem for simulation-to-real embodied AI presents many challenges: (1) the inherently interactive nature of the problem, (2) the need for tight alignments between real and simulated worlds, (3) the difficulty of replicating physical conditions for repeatable experiments, (4) and the associated cost. In this paper, we introduce RoboTHOR to democratize research in interactive and embodied visual AI. RoboTHOR offers a framework of simulated environments paired with physical counterparts to systematically explore and overcome the challenges of simulation-to-real transfer, and a platform where researchers across the globe can remotely test their embodied models in the physical world. As a first benchmark, our experiments show there exists a significant gap between the performance of models trained in simulation when they are tested in both simulations and their carefully constructed physical analogs. We hope that RoboTHOR will spur the next stage of evolution in embodied computer vision. RoboTHOR can be accessed at the following link: https://ai2thor.allenai.org/robothor

Paper Structure

This paper contains 9 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: We present RoboTHOR, a platform to develop and test embodied AI agents with corresponding environments in simulation and the physical world. The complexity of environments in RoboTHOR along with disparities in appearance and control dynamics between simulation and reality pose new challenges and open many avenues for further research.
  • Figure 2: Distribution of object categories in RoboTHOR
  • Figure 3: Spatial distribution of objects and walls. Heatmaps illustrate the diverse spatial distribution of target objects, background objects, furniture, and walls.
  • Figure 4: Object visibility statistics. The distribution of objects visible to an agent at a single time instant.
  • Figure 5: Histogram of actions along the shortest path. The number of actions invoked along the shortest paths to targets in the training scenes. Note that the shortest path is very difficult to obtain in practice, since it assumes a priori knowledge of the scene.
  • ...and 4 more figures