Table of Contents
Fetching ...

NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions

Haolin Yang, Yuxing Long, Zhuoyuan Yu, Zihan Yang, Minghan Wang, Jiapeng Xu, Yihan Wang, Ziyan Yu, Wenzhe Cai, Lei Kang, Hao Dong

TL;DR

NavSpace introduces the first spatial-intelligence benchmark for instruction-following navigation, addressing the gap where existing benchmarks focus on semantic understanding. It defines a four-stage construction pipeline and six spatial categories, and evaluates 22 agents including multimodal large models. The paper reveals that current MLLMs struggle with embodied spatial tasks, while a specialized navigation large model, SNav, achieves strong performance and benefits from targeted spatial data generation. Real-world tests with a quadruped robot corroborate NavSpace findings and establish SNav as a robust baseline for future spatially intelligent navigation research. The work highlights the need for improved spatial perception and reliable perception-to-action translation in embodied agents.

Abstract

Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.

NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions

TL;DR

NavSpace introduces the first spatial-intelligence benchmark for instruction-following navigation, addressing the gap where existing benchmarks focus on semantic understanding. It defines a four-stage construction pipeline and six spatial categories, and evaluates 22 agents including multimodal large models. The paper reveals that current MLLMs struggle with embodied spatial tasks, while a specialized navigation large model, SNav, achieves strong performance and benefits from targeted spatial data generation. Real-world tests with a quadruped robot corroborate NavSpace findings and establish SNav as a robust baseline for future spatially intelligent navigation research. The work highlights the need for improved spatial perception and reliable perception-to-action translation in embodied agents.

Abstract

Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.

Paper Structure

This paper contains 16 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: (Left) Everyday navigation instructions that require spatial intelligence. To execute these instructions, a navigation agent must perceive and reason about space layout, scale, agent–object relative orientations, and environmental state. As the first benchmark to evaluate navigation agents' spatial intelligence, NavSpace collects navigation instructions covering the above six types of spatial-intelligence capabilities. (Right) Evaluation results on NavSpace about navigation agents driven by multimodal large models and navigation models. We further propose SNav model to serve as a strong baseline.
  • Figure 2: Construction pipeline of NavSpace. (1) Questionnaire Survey: identify which forms of navigation instruction best reflect spatial intelligence. (2) Trajectory Collection: teleoperate agents in a simulated environment to record trajectories. (3) Instruction Annotation: use large-model–assisted analysis to create navigation instructions requiring spatial-intelligence. (4) Human cross‑validation: manually review and validate the annotated instructions to ensure correctness and executability.
  • Figure 3: Instruction Categories in NavSpace. These six categories were determined based on the questionnaire survey results. Every navigation trajectory and instruction was collected manually from HM3D scene datasets through our designed platform.
  • Figure 4: Visualization of NavSpace Statistics.
  • Figure 5: Framework of SNav model. (Left) We propose a set of pipelines to create 4 types of spatially intelligent navigation instructions from existing scene data and instruction navigation data. (Right) With these generated data, we further finetune an end-to-end navigation foundation model to obtain a navigation large model SNav with enhanced spatial intelligence.
  • ...and 2 more figures