Table of Contents
Fetching ...

Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

Maeva Guerrier, Karthik Soma, Jana Pavlasek, Giovanni Beltrame

Abstract

Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions, indicating limited geometric understanding; (b) models fail to discriminate between different locations that are perceptually similar, however some semantics differences are present, causing goal prediction errors in repetitive environments; and (c) performance degrades under distribution shift. We will publicly release our evaluation codebase and dataset to facilitate reproducible benchmarking of VNMs.

Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

Abstract

Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions, indicating limited geometric understanding; (b) models fail to discriminate between different locations that are perceptually similar, however some semantics differences are present, causing goal prediction errors in repetitive environments; and (c) performance degrades under distribution shift. We will publicly release our evaluation codebase and dataset to facilitate reproducible benchmarking of VNMs.

Paper Structure

This paper contains 18 sections, 2 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: A Visual Navigation Model (VNM) takes as input a sequence of $k$ image observations $O_t$ and a goal image $o_g$ at timestep $t$. An image and distance encoder encode the observation sequence and the current ($o_t$) and goal observation respectively. The backbone and decoder vary by method. The model outputs action $\hat{A}_t$, which can take the form of a single next waypoint or a trajectory, and can also include the temporal distance $d$.
  • Figure 2: Real-world evaluation environments (indoor and outdoor). Blue panels correspond to rover while yellow are quadruped deployments.
  • Figure 3: Image quality metrics (LPIPS, PSNR, DSSIM) for goal-predicted cases in the Arena (Top) and Snow (Bottom) environments.
  • Figure 4: Corridor deployment results. Green (left): goal reached without collision in all trials. Red (right): at least one collision occurred. See Table \ref{['tab:collision']} for details. (Note: overlapping collisions at the same location are shown only once for clarity).
  • Figure 5: All Stairs trajectories for GNM, ViNT, NoMaD and NaviBridger (see Table \ref{['tab:generalization_metrics_all_env']}) with the reference trajectory.
  • ...and 3 more figures