NaviTrace: Evaluating Embodied Navigation of Vision-Language Models
Tim Windecker, Manthan Patel, Moritz Reuss, Richard Schwarzkopf, Cesar Cadena, Rudolf Lioutikov, Marco Hutter, Jonas Frey
TL;DR
The paper addresses evaluating embodied navigation capabilities of vision-language models by introducing NaviTrace, a real-world, VQA-style benchmark in which models output 2D image-space traces conditioned on an instruction and embodiment. It then introduces a semantic-aware trace score that combines $DTW$ distance, $FDE$, and embodiment-specific penalties derived from automatic semantic maps, and demonstrates its stronger alignment with human judgments than $DTW$ alone. The dataset comprises 1,000 diverse scenes with over 3,000 expert traces across four embodiments, enabling scalable, reproducible evaluation and a public leaderboard. Key findings reveal a substantial gap between current VLMs and human performance, with goal localization and spatial grounding as primary bottlenecks; the proposed score provides a practical, interpretable metric for guiding future improvements in embodied navigation.
Abstract
Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.
