NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions
Haolin Yang, Yuxing Long, Zhuoyuan Yu, Zihan Yang, Minghan Wang, Jiapeng Xu, Yihan Wang, Ziyan Yu, Wenzhe Cai, Lei Kang, Hao Dong
TL;DR
NavSpace introduces the first spatial-intelligence benchmark for instruction-following navigation, addressing the gap where existing benchmarks focus on semantic understanding. It defines a four-stage construction pipeline and six spatial categories, and evaluates 22 agents including multimodal large models. The paper reveals that current MLLMs struggle with embodied spatial tasks, while a specialized navigation large model, SNav, achieves strong performance and benefits from targeted spatial data generation. Real-world tests with a quadruped robot corroborate NavSpace findings and establish SNav as a robust baseline for future spatially intelligent navigation research. The work highlights the need for improved spatial perception and reliable perception-to-action translation in embodied agents.
Abstract
Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.
