SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation

Jiahang Liu; Tianyu Xu; Jiawei Chen; Lu Yue; Jiazhao Zhang; Zhiyong Wang; Minghan Li; Qisheng Zhao; Anqi Li; Qi Su; Zhizheng Zhang; He Wang

SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation

Jiahang Liu, Tianyu Xu, Jiawei Chen, Lu Yue, Jiazhao Zhang, Zhiyong Wang, Minghan Li, Qisheng Zhao, Anqi Li, Qi Su, Zhizheng Zhang, He Wang

TL;DR

SPAN-Nav, an end-to-end foundation model designed to infuse embodied navigation with universal 3D spatial awareness using RGB video streams, and introduces a compact representation for spatial priors, enabling robust spatial awareness to generalize even to the task lacking explicit spatial supervision.

Abstract

Recent embodied navigation approaches leveraging Vision-Language Models (VLMs) demonstrate strong generalization in versatile Vision-Language Navigation (VLN). However, reliable path planning in complex environments remains challenging due to insufficient spatial awareness. In this work, we introduce SPAN-Nav, an end-to-end foundation model designed to infuse embodied navigation with universal 3D spatial awareness using RGB video streams. SPAN-Nav extracts spatial priors across diverse scenes through an occupancy prediction task on extensive indoor and outdoor environments. To mitigate the computational burden, we introduce a compact representation for spatial priors, finding that a single token is sufficient to encapsulate the coarse-grained cues essential for navigation tasks. Furthermore, inspired by the Chain-of-Thought (CoT) mechanism, SPAN-Nav utilizes this single spatial token to explicitly inject spatial cues into action reasoning through an end-to end framework. Leveraging multi-task co-training, SPAN-Nav captures task-adaptive cues from generalized spatial priors, enabling robust spatial awareness to generalize even to the task lacking explicit spatial supervision. To support comprehensive spatial learning, we present a massive dataset of 4.2 million occupancy annotations that covers both indoor and outdoor scenes across multi-type navigation tasks. SPAN-Nav achieves state-of-the-art performance across three benchmarks spanning diverse scenarios and varied navigation tasks. Finally, real-world experiments validate the robust generalization and practical reliability of our approach across complex physical scenarios.

SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation

TL;DR

Abstract

Paper Structure (30 sections, 20 equations, 11 figures, 4 tables)

This paper contains 30 sections, 20 equations, 11 figures, 4 tables.

Introduction
Related Works
Overview
Model of SPAN-Nav
Observation Encoding
Cross-Scene Spatial Token Learning
Spatial Chain-of-Thought Action Reasoning
Training Strategy
Dataset of SPAN-Nav
Overview
Navigation Data Collection for Multi Task
Occupancy Annotation
Experiments
Experiment Setup
Benchmark Results
...and 15 more sections

Figures (11)

Figure 1: SPAN-Nav empowers end-to-end embodied navigation with generalized 3D perception. By introducing a Spatially-aware Chain-of-Thought (CoT) mechanism, SPAN-Nav achieves precise and safe navigation in complex environments.
Figure 2: Overview of SPAN-Nav. (a) Encoder-Decoder (ED) Initialization. We pre-train a VQ-VAE-based architecture on cross-scene occupancy datasets. This initializes the perceptual representation capabilities of the encoder and decoder, facilitating subsequent end-to-end training. (b) SPAN-Nav Pipeline. SPAN-Nav leverages a VLM-based architecture to learn 3D spatial awareness through a designed spatial token, which is explicitly injected into the navigation reasoning process, facilitating efficient and spatially-aware decision-making. (c) Adaptive Spatial Token Acquisition. The source of the spatial token adapts to the training phase. In training stage I, tokens are derived from Ground-Truth (GT) occupancy. In training stage II and during inference, the model switches to relying on its self-predicted spatial tokens.
Figure 3: Training stages for navigation-specific tasks. In Stage I, SPAN-Nav is trained via teacher-forcing using ground-truth (GT) occupancy, jointly supervised by occupancy, spatial representation and action. Stage II transitions to student-forcing, inferring actions from self-predicted spatial tokens. Notably, $\mathcal{L}_{\text{occ}}$ only applies when GT occupancy is available in this stage. (Legends follow \ref{['fig:main']}.)
Figure 4: Dataset composition and occupancy-map visualization. Sunburst summary of our data composition. The inner ring groups samples by task, the middle ring by dataset, and the outer ring by occupancy-map (Occ) source / availability. MU, SK, and FB2K denote MetaUrban, SeKai, and FrodoBots-2K, respectively.
Figure 5: Experimental visualization of SPAN-Nav across diverse tasks in different simulators. This includes PointGoal Navigation, VLN, and UrbanNav, encompassing a variety of complex indoor and outdoor scenarios. Across these tasks, SPAN-Nav consistently plans safe and robust trajectories, ensuring high navigation success rates.
...and 6 more figures

SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation

TL;DR

Abstract

SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)