Table of Contents
Fetching ...

What does really matter in image goal navigation?

Gianluca Monaci, Philippe Weinzaepfel, Christian Wolf

TL;DR

This work interrogates whether image-goal navigation can be learned end-to-end from navigation signals alone. Through a large-scale, controlled study of architectural choices (early vs late fusion, channel stacking, SpaceToDepth, cross-attention) and simulator settings, it shows that realistic training without pre-training falls short, while pre-trained binocular encoders and early fusion substantially boost performance. The authors demonstrate a clear correlation between navigation success and emergent relative pose estimation, and reveal that the commonly used Sliding setting in Habitat heavily influences both perception learning and transfer to more realistic scenarios. These findings suggest that pre-training and architectural design enabling local correspondences are essential for robust, transferable image-goal navigation, with implications for related relative pose estimation tasks.

Abstract

Image goal navigation requires two different skills: firstly, core navigation skills, including the detection of free space and obstacles, and taking decisions based on an internal representation; and secondly, computing directional information by comparing visual observations to the goal image. Current state-of-the-art methods either rely on dedicated image-matching, or pre-training of computer vision modules on relative pose estimation. In this paper, we study whether this task can be efficiently solved with end-to-end training of full agents with RL, as has been claimed by recent work. A positive answer would have impact beyond Embodied AI and allow training of relative pose estimation from reward for navigation alone. In this large experimental study we investigate the effect of architectural choices like late fusion, channel stacking, space-to-depth projections and cross-attention, and their role in the emergence of relative pose estimators from navigation training. We show that the success of recent methods is influenced up to a certain extent by simulator settings, leading to shortcuts in simulation. However, we also show that these capabilities can be transferred to more realistic setting, up to some extent. We also find evidence for correlations between navigation performance and probed (emerging) relative pose estimation performance, an important sub skill.

What does really matter in image goal navigation?

TL;DR

This work interrogates whether image-goal navigation can be learned end-to-end from navigation signals alone. Through a large-scale, controlled study of architectural choices (early vs late fusion, channel stacking, SpaceToDepth, cross-attention) and simulator settings, it shows that realistic training without pre-training falls short, while pre-trained binocular encoders and early fusion substantially boost performance. The authors demonstrate a clear correlation between navigation success and emergent relative pose estimation, and reveal that the commonly used Sliding setting in Habitat heavily influences both perception learning and transfer to more realistic scenarios. These findings suggest that pre-training and architectural design enabling local correspondences are essential for robust, transferable image-goal navigation, with implications for related relative pose estimation tasks.

Abstract

Image goal navigation requires two different skills: firstly, core navigation skills, including the detection of free space and obstacles, and taking decisions based on an internal representation; and secondly, computing directional information by comparing visual observations to the goal image. Current state-of-the-art methods either rely on dedicated image-matching, or pre-training of computer vision modules on relative pose estimation. In this paper, we study whether this task can be efficiently solved with end-to-end training of full agents with RL, as has been claimed by recent work. A positive answer would have impact beyond Embodied AI and allow training of relative pose estimation from reward for navigation alone. In this large experimental study we investigate the effect of architectural choices like late fusion, channel stacking, space-to-depth projections and cross-attention, and their role in the emergence of relative pose estimators from navigation training. We show that the success of recent methods is influenced up to a certain extent by simulator settings, leading to shortcuts in simulation. However, we also show that these capabilities can be transferred to more realistic setting, up to some extent. We also find evidence for correlations between navigation performance and probed (emerging) relative pose estimation performance, an important sub skill.

Paper Structure

This paper contains 15 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Image goal navigation requires general navigation skills, but also in particular the extraction of directional information towards the goal. We analyze which architecture design choices influence these capabilities, and to what degree they --- and the underlying sub task of relative pose estimation, which we probe with a dedicated head $p$ --- can be trained end-to-end from the navigation loss directly, without any pose ground-truth.
  • Figure 2: Different architecture choices for binocular encoders learning to compare the observed image $\mathbf{o}_t$ with the goal image $\mathbf{g}$: (a) Late Fusion encodes them separately and comparison is done "late" between embedding vectors $\phi_o(\mathbf{o}_t)$ and $\phi_g(\mathbf{g})$, making correspondence computations difficult. (b) ChannelCat stacks images over the channel dimension, followed by convolutional encoders $\phi\left([\mathbf{o}_t, \mathbf{g}]_{\textrm{dim}=1}\right)$. It makes correspondence computations possible in principle if the CNN receptive field is big enough. (c) SpaceToDepth reshapes the patch dimension into the channel dimension. Combined with ChannelCatsun2024fgprompt, it could allow correspondence to emerge in each layer directly through conv filters. (d) Binocular ViTsCrocoNav2024 model correspondence directly as cross-attention between patch tokens.
  • Figure 3: Nav vs. Rel-pose: navigation perf. (SR,%) plotted against pose est. probing accuracy (% for err ${<}2m,20^\circ$) for 4 types of visual encoders $\phi$: trained w. sliding, trained w/o sliding, pre-trained w. RPVE, trained w. sliding, finetuned w/o; (LF = Late Fusion, CC=ChannelCat). The dashed line relates the finetuned models to the same model trained w/o sliding.
  • Figure 4: Analysis of navigation behavior: Sankey plots show the distribution of success/failure codes over 994 test episodes for different models, and their "flow" between certain pairs of models. For instance, the strength of the connection between "Time out" (left) and "success" (right) indicates how many episodes toggled from one to the other when switching from the left to the right model.