What does really matter in image goal navigation?
Gianluca Monaci, Philippe Weinzaepfel, Christian Wolf
TL;DR
This work interrogates whether image-goal navigation can be learned end-to-end from navigation signals alone. Through a large-scale, controlled study of architectural choices (early vs late fusion, channel stacking, SpaceToDepth, cross-attention) and simulator settings, it shows that realistic training without pre-training falls short, while pre-trained binocular encoders and early fusion substantially boost performance. The authors demonstrate a clear correlation between navigation success and emergent relative pose estimation, and reveal that the commonly used Sliding setting in Habitat heavily influences both perception learning and transfer to more realistic scenarios. These findings suggest that pre-training and architectural design enabling local correspondences are essential for robust, transferable image-goal navigation, with implications for related relative pose estimation tasks.
Abstract
Image goal navigation requires two different skills: firstly, core navigation skills, including the detection of free space and obstacles, and taking decisions based on an internal representation; and secondly, computing directional information by comparing visual observations to the goal image. Current state-of-the-art methods either rely on dedicated image-matching, or pre-training of computer vision modules on relative pose estimation. In this paper, we study whether this task can be efficiently solved with end-to-end training of full agents with RL, as has been claimed by recent work. A positive answer would have impact beyond Embodied AI and allow training of relative pose estimation from reward for navigation alone. In this large experimental study we investigate the effect of architectural choices like late fusion, channel stacking, space-to-depth projections and cross-attention, and their role in the emergence of relative pose estimators from navigation training. We show that the success of recent methods is influenced up to a certain extent by simulator settings, leading to shortcuts in simulation. However, we also show that these capabilities can be transferred to more realistic setting, up to some extent. We also find evidence for correlations between navigation performance and probed (emerging) relative pose estimation performance, an important sub skill.
