Navigating to Objects in the Real World
Theophile Gervet, Soumith Chintala, Dhruv Batra, Jitendra Malik, Devendra Singh Chaplot
TL;DR
The paper investigates semantic navigation for mobile robots by comparing classical, modular learning, and end-to-end approaches in real homes and in simulation. It finds modular learning delivers robust real-world performance with a real-world SR of $90\%$, while end-to-end policies suffer a sharp drop to $23\%$ due to a substantial image-domain gap between sim and reality; classical methods perform around $80\%$. A controlled sim replica reveals distinct Sim-to-Real error modes, highlighting the mismatch between sim reconstructions and real depth noise as a major bottleneck. The authors advocate modularity and semantic abstraction as a reliable path for sim-to-real transfer and outline concrete steps to improve simulators and evaluation benchmarks to better reflect real-world conditions and error modes.
Abstract
Semantic navigation is necessary to deploy mobile robots in uncontrolled environments like our homes, schools, and hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation, which builds a geometric map using depth sensors and plans to reach point goals. Broadly, end-to-end learning approaches reactively map sensor inputs to actions with deep neural networks, while modular learning approaches enrich the classical pipeline with learning-based semantic sensing and exploration. But learned visual navigation policies have predominantly been evaluated in simulation. How well do different classes of methods work on a robot? We present a large-scale empirical study of semantic visual navigation methods comparing representative methods from classical, modular, and end-to-end learning approaches across six homes with no prior experience, maps, or instrumentation. We find that modular learning works well in the real world, attaining a 90% success rate. In contrast, end-to-end learning does not, dropping from 77% simulation to 23% real-world success rate due to a large image domain gap between simulation and reality. For practitioners, we show that modular learning is a reliable approach to navigate to objects: modularity and abstraction in policy design enable Sim-to-Real transfer. For researchers, we identify two key issues that prevent today's simulators from being reliable evaluation benchmarks - (A) a large Sim-to-Real gap in images and (B) a disconnect between simulation and real-world error modes - and propose concrete steps forward.
