Table of Contents
Fetching ...

Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation

Xuesong Zhang, Yunbo Xu, Jia Li, Ruonan Liu, Zhenzhen Hu

TL;DR

The paper addresses Vision-and-Language Navigation (VLN) by tackling modality heterogeneity between language and vision. It introduces SUSA, a hierarchical framework comprising Textual-Aware Semantic Understanding (TSU) and Depth-Enhanced Spatial Perception (DSP), fused through Hierarchical Aggregation and Prediction (HAP) to align rich semantic and spatial cues with natural language instructions. A contrastive learning objective, along with partial pretraining, enables robust instruction-grounding and generalization across discrete benchmarks (R2R, REVERIE, SOON) and continuous settings (R2R-CE). Extensive ablations validate the distinct benefits of textual panoramas, depth exploration maps, and cross-modal fusion, while qualitative analyses illustrate improved grounding and navigation decisions. Overall, SUSA demonstrates that explicit, modality-specific representations and their hierarchical integration yield substantial gains in VLN performance and cross-domain generalization.

Abstract

Navigating unseen environments from natural language instructions remains challenging for egocentric agents in Vision-and-Language Navigation (VLN). Humans naturally ground concrete semantic knowledge within spatial layouts during indoor navigation. Although prior work has introduced diverse environment representations to improve reasoning, auxiliary modalities are often naively concatenated with RGB features, which underutilizes each modality's distinct contribution. We propose a hierarchical Semantic Understanding and Spatial Awareness (SUSA) architecture to enable agents to perceive and ground environments at multiple scales. Specifically, the Textual Semantic Understanding (TSU) module supports local action prediction by generating view-level descriptions, capturing fine-grained semantics and narrowing the modality gap between instructions and environments. Complementarily, the Depth Enhanced Spatial Perception (DSP) module incrementally builds a trajectory-level depth exploration map, providing a coarse-grained representation of global spatial layout. Extensive experiments show that the hierarchical representation enrichment of SUSA significantly improves navigation performance over the baseline on discrete VLN benchmarks (REVERIE, R2R, and SOON) and generalizes better to the continuous R2R-CE benchmark.

Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation

TL;DR

The paper addresses Vision-and-Language Navigation (VLN) by tackling modality heterogeneity between language and vision. It introduces SUSA, a hierarchical framework comprising Textual-Aware Semantic Understanding (TSU) and Depth-Enhanced Spatial Perception (DSP), fused through Hierarchical Aggregation and Prediction (HAP) to align rich semantic and spatial cues with natural language instructions. A contrastive learning objective, along with partial pretraining, enables robust instruction-grounding and generalization across discrete benchmarks (R2R, REVERIE, SOON) and continuous settings (R2R-CE). Extensive ablations validate the distinct benefits of textual panoramas, depth exploration maps, and cross-modal fusion, while qualitative analyses illustrate improved grounding and navigation decisions. Overall, SUSA demonstrates that explicit, modality-specific representations and their hierarchical integration yield substantial gains in VLN performance and cross-domain generalization.

Abstract

Navigating unseen environments from natural language instructions remains challenging for egocentric agents in Vision-and-Language Navigation (VLN). Humans naturally ground concrete semantic knowledge within spatial layouts during indoor navigation. Although prior work has introduced diverse environment representations to improve reasoning, auxiliary modalities are often naively concatenated with RGB features, which underutilizes each modality's distinct contribution. We propose a hierarchical Semantic Understanding and Spatial Awareness (SUSA) architecture to enable agents to perceive and ground environments at multiple scales. Specifically, the Textual Semantic Understanding (TSU) module supports local action prediction by generating view-level descriptions, capturing fine-grained semantics and narrowing the modality gap between instructions and environments. Complementarily, the Depth Enhanced Spatial Perception (DSP) module incrementally builds a trajectory-level depth exploration map, providing a coarse-grained representation of global spatial layout. Extensive experiments show that the hierarchical representation enrichment of SUSA significantly improves navigation performance over the baseline on discrete VLN benchmarks (REVERIE, R2R, and SOON) and generalizes better to the continuous R2R-CE benchmark.

Paper Structure

This paper contains 25 sections, 10 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: An overview of the proposed SUSA. Beyond RGB inputs, we introduce view-level textual panoramas and trajectory-level depth exploration maps, supporting the agent’s explicit understanding of the environment.
  • Figure 2: The detailed architecture of our proposed SUSA model. The orange and green arrows highlight our proposed Textual-Aware Semantic Understanding (TSU) and Depth-Aware Spatial Perception (DSP) modules, respectively. The Hierarchical Aggregation and Prediction (HAP) module is designed for hierarchical aligning environmental representations with instructions.
  • Figure 3: Illustration of the TSU module in Fig.\ref{['fig:SUSA']}, which matches instructions and semantic features by static and dynamic matching strategies.
  • Figure 4: The pipeline for fusing hybrid environmental representations in the proposed HAP module in Fig. \ref{['fig:SUSA']}.
  • Figure 5: (a) and (b) show the performance gap ($\downarrow$, lower values indicate better performance.) between seen and unseen environments, while (c) and (d) present key metrics ($\uparrow$) for different pretraining strategies on the R2R and REVERIE.
  • ...and 7 more figures