Table of Contents
Fetching ...

Navigating to Objects in the Real World

Theophile Gervet, Soumith Chintala, Dhruv Batra, Jitendra Malik, Devendra Singh Chaplot

TL;DR

The paper investigates semantic navigation for mobile robots by comparing classical, modular learning, and end-to-end approaches in real homes and in simulation. It finds modular learning delivers robust real-world performance with a real-world SR of $90\%$, while end-to-end policies suffer a sharp drop to $23\%$ due to a substantial image-domain gap between sim and reality; classical methods perform around $80\%$. A controlled sim replica reveals distinct Sim-to-Real error modes, highlighting the mismatch between sim reconstructions and real depth noise as a major bottleneck. The authors advocate modularity and semantic abstraction as a reliable path for sim-to-real transfer and outline concrete steps to improve simulators and evaluation benchmarks to better reflect real-world conditions and error modes.

Abstract

Semantic navigation is necessary to deploy mobile robots in uncontrolled environments like our homes, schools, and hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation, which builds a geometric map using depth sensors and plans to reach point goals. Broadly, end-to-end learning approaches reactively map sensor inputs to actions with deep neural networks, while modular learning approaches enrich the classical pipeline with learning-based semantic sensing and exploration. But learned visual navigation policies have predominantly been evaluated in simulation. How well do different classes of methods work on a robot? We present a large-scale empirical study of semantic visual navigation methods comparing representative methods from classical, modular, and end-to-end learning approaches across six homes with no prior experience, maps, or instrumentation. We find that modular learning works well in the real world, attaining a 90% success rate. In contrast, end-to-end learning does not, dropping from 77% simulation to 23% real-world success rate due to a large image domain gap between simulation and reality. For practitioners, we show that modular learning is a reliable approach to navigate to objects: modularity and abstraction in policy design enable Sim-to-Real transfer. For researchers, we identify two key issues that prevent today's simulators from being reliable evaluation benchmarks - (A) a large Sim-to-Real gap in images and (B) a disconnect between simulation and real-world error modes - and propose concrete steps forward.

Navigating to Objects in the Real World

TL;DR

The paper investigates semantic navigation for mobile robots by comparing classical, modular learning, and end-to-end approaches in real homes and in simulation. It finds modular learning delivers robust real-world performance with a real-world SR of , while end-to-end policies suffer a sharp drop to due to a substantial image-domain gap between sim and reality; classical methods perform around . A controlled sim replica reveals distinct Sim-to-Real error modes, highlighting the mismatch between sim reconstructions and real depth noise as a major bottleneck. The authors advocate modularity and semantic abstraction as a reliable path for sim-to-real transfer and outline concrete steps to improve simulators and evaluation benchmarks to better reflect real-world conditions and error modes.

Abstract

Semantic navigation is necessary to deploy mobile robots in uncontrolled environments like our homes, schools, and hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation, which builds a geometric map using depth sensors and plans to reach point goals. Broadly, end-to-end learning approaches reactively map sensor inputs to actions with deep neural networks, while modular learning approaches enrich the classical pipeline with learning-based semantic sensing and exploration. But learned visual navigation policies have predominantly been evaluated in simulation. How well do different classes of methods work on a robot? We present a large-scale empirical study of semantic visual navigation methods comparing representative methods from classical, modular, and end-to-end learning approaches across six homes with no prior experience, maps, or instrumentation. We find that modular learning works well in the real world, attaining a 90% success rate. In contrast, end-to-end learning does not, dropping from 77% simulation to 23% real-world success rate due to a large image domain gap between simulation and reality. For practitioners, we show that modular learning is a reliable approach to navigate to objects: modularity and abstraction in policy design enable Sim-to-Real transfer. For researchers, we identify two key issues that prevent today's simulators from being reliable evaluation benchmarks - (A) a large Sim-to-Real gap in images and (B) a disconnect between simulation and real-world error modes - and propose concrete steps forward.
Paper Structure (15 sections, 12 figures, 4 tables)

This paper contains 15 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Deployment of the semantic navigation policies in six visually diverse homes.
  • Figure 2: Three approaches to navigate to objects.(A) The modular learning approach builds a top-down semantic map, selects a semantic exploration goal in this space, and plans low-level actions to reach this goal. (B) The classical approach also builds a semantic map but selects the closest unexplored region as the exploration goal, independently of the goal object. (C) The end-to-end learning approach directly maps sensor inputs and the goal object to low-level actions with a deep neural network.
  • Figure 3: Navigation performance in simulation vs. real at scale. We compare the Success Rate (SR) and Success weighted by Path Length (SPL) of methods representative of classical, end-to-end-learning, and modular learning approaches on large real world ($60$ episodes in $6$ homes) and simulation datasets (single-floor navigation episodes of val split of the 2022 Habitat Challenge with $1093$ episodes in $20$ simulated homes of the HM3D Semantics dataset yadav2022habitat). (A) Performance for all methods is comparable in simulation, at around $80$% success rate. (B) Classical and modular learning approaches transfer well, up from $78$% to $80$% and $81$% to $90$%, respectively. (C) End-to-end learning fails to transfer, down from $77$% to $23$% success rate.
  • Figure 4: Three approaches on the same episode.(A) Modular learning reaches the couch goal in $84$ steps (SPL $= 0.74$). (B) End-to-end learning collides too many times ($20$ max) after $121$ steps. (C) The classical policy reaches the goal after $181$ steps and a detour through the kitchen (SPL $= 0.33$).
  • Figure 5: Sim-vs-Real domain invariances, gaps, and their effects on segmentation. From left to right, all images come from episodes in our controlled study: (A) The semantic map space is invariant between the real world and simulation. (B) The image space exhibits a large gap between the real world and simulation. (C) This gap causes a large drop in performance when transferring a segmentation model trained in the real world to simulation and vice versa.
  • ...and 7 more figures