Table of Contents
Fetching ...

NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

Tim Windecker, Manthan Patel, Moritz Reuss, Richard Schwarzkopf, Cesar Cadena, Rudolf Lioutikov, Marco Hutter, Jonas Frey

TL;DR

The paper addresses evaluating embodied navigation capabilities of vision-language models by introducing NaviTrace, a real-world, VQA-style benchmark in which models output 2D image-space traces conditioned on an instruction and embodiment. It then introduces a semantic-aware trace score that combines $DTW$ distance, $FDE$, and embodiment-specific penalties derived from automatic semantic maps, and demonstrates its stronger alignment with human judgments than $DTW$ alone. The dataset comprises 1,000 diverse scenes with over 3,000 expert traces across four embodiments, enabling scalable, reproducible evaluation and a public leaderboard. Key findings reveal a substantial gap between current VLMs and human performance, with goal localization and spatial grounding as primary bottlenecks; the proposed score provides a practical, interpretable metric for guiding future improvements in embodied navigation.

Abstract

Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.

NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

TL;DR

The paper addresses evaluating embodied navigation capabilities of vision-language models by introducing NaviTrace, a real-world, VQA-style benchmark in which models output 2D image-space traces conditioned on an instruction and embodiment. It then introduces a semantic-aware trace score that combines distance, , and embodiment-specific penalties derived from automatic semantic maps, and demonstrates its stronger alignment with human judgments than alone. The dataset comprises 1,000 diverse scenes with over 3,000 expert traces across four embodiments, enabling scalable, reproducible evaluation and a public leaderboard. Key findings reveal a substantial gap between current VLMs and human performance, with goal localization and spatial grounding as primary bottlenecks; the proposed score provides a practical, interpretable metric for guiding future improvements in embodied navigation.

Abstract

Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.

Paper Structure

This paper contains 13 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We introduce NaviTrace, a novel VQA benchmark for VLMs that evaluates models on their embodiment-specific understanding of navigation across challenging real-world scenarios.
  • Figure 2: Left: Geographic distribution of image sources, with the inner circle denoting countries and the outer circle specifying cities or regions. Images originating from the GrandTour Dataset grandtour are explicitly marked in the outer circle. Right: Distribution of scenarios by setting (urban vs. rural), environment type (natural vs. structured), lighting, and weather.
  • Figure 3: Left: Comparison between penalty cost masks based on Mask2Former and manual segmentation. These masks are used to punish traces crossing unsafe or irrelevant areas. Right: We show that the score function aligns with human preference by calculating the correlation between the score ranking and a pairwise ranking created by a human.
  • Figure 4: Left: Ranking of VLMs, the uninformed baseline Straight Forward, and human expert performance split into each embodiment. Note that a higher score is better. Right: Performance per task category for the same models.
  • Figure 5: Example predictions by the models Gemini 2.5 Pro, GPT-5, Qwen 3 VL, and o3.
  • ...and 1 more figures