Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation

Xuesong Zhang; Yunbo Xu; Jia Li; Ruonan Liu; Zhenzhen Hu

Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation

Xuesong Zhang, Yunbo Xu, Jia Li, Ruonan Liu, Zhenzhen Hu

TL;DR

The paper addresses Vision-and-Language Navigation (VLN) by tackling modality heterogeneity between language and vision. It introduces SUSA, a hierarchical framework comprising Textual-Aware Semantic Understanding (TSU) and Depth-Enhanced Spatial Perception (DSP), fused through Hierarchical Aggregation and Prediction (HAP) to align rich semantic and spatial cues with natural language instructions. A contrastive learning objective, along with partial pretraining, enables robust instruction-grounding and generalization across discrete benchmarks (R2R, REVERIE, SOON) and continuous settings (R2R-CE). Extensive ablations validate the distinct benefits of textual panoramas, depth exploration maps, and cross-modal fusion, while qualitative analyses illustrate improved grounding and navigation decisions. Overall, SUSA demonstrates that explicit, modality-specific representations and their hierarchical integration yield substantial gains in VLN performance and cross-domain generalization.

Abstract

Navigating unseen environments from natural language instructions remains challenging for egocentric agents in Vision-and-Language Navigation (VLN). Humans naturally ground concrete semantic knowledge within spatial layouts during indoor navigation. Although prior work has introduced diverse environment representations to improve reasoning, auxiliary modalities are often naively concatenated with RGB features, which underutilizes each modality's distinct contribution. We propose a hierarchical Semantic Understanding and Spatial Awareness (SUSA) architecture to enable agents to perceive and ground environments at multiple scales. Specifically, the Textual Semantic Understanding (TSU) module supports local action prediction by generating view-level descriptions, capturing fine-grained semantics and narrowing the modality gap between instructions and environments. Complementarily, the Depth Enhanced Spatial Perception (DSP) module incrementally builds a trajectory-level depth exploration map, providing a coarse-grained representation of global spatial layout. Extensive experiments show that the hierarchical representation enrichment of SUSA significantly improves navigation performance over the baseline on discrete VLN benchmarks (REVERIE, R2R, and SOON) and generalizes better to the continuous R2R-CE benchmark.

Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation

TL;DR

Abstract

Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)