Table of Contents
Fetching ...

ST-Booster: An Iterative SpatioTemporal Perception Booster for Vision-and-Language Navigation in Continuous Environments

Lu Yue, Dongliang Zhou, Liang Xie, Erwei Yin, Feitian Zhang

TL;DR

This work addresses perception challenges in Vision-and-Language Navigation in Continuous Environments (VLN-CE) by introducing ST-Booster, an iterative spatiotemporal booster. It combines Hierarchical SpatioTemporal Encoding (HSTE), Multi-Granularity Aligned Fusion (MGAF), and Value-Guided Waypoint Generation (VGWG) to fuse global topological graphs and local BEV grids, align them with instructions, and generate Guided Attention Heatmaps for instruction-aware waypoint selection. Through weakly supervised GAH training, pretraining (MLM, HSAP, GAHP), and fine-tuning in continuous simulations, ST-Booster achieves state-of-the-art performance, especially in unseen and disturbance-prone environments. The approach advances VLN-CE by enabling robust, interpretable spatiotemporal perception and more reliable decision making in complex 3D environments.

Abstract

Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate unknown, continuous spaces based on natural language instructions. Compared to discrete settings, VLN-CE poses two core perception challenges. First, the absence of predefined observation points leads to heterogeneous visual memories and weakened global spatial correlations. Second, cumulative reconstruction errors in three-dimensional scenes introduce structural noise, impairing local feature perception. To address these challenges, this paper proposes ST-Booster, an iterative spatiotemporal booster that enhances navigation performance through multi-granularity perception and instruction-aware reasoning. ST-Booster consists of three key modules -- Hierarchical SpatioTemporal Encoding (HSTE), Multi-Granularity Aligned Fusion (MGAF), and ValueGuided Waypoint Generation (VGWG). HSTE encodes long-term global memory using topological graphs and captures shortterm local details via grid maps. MGAF aligns these dualmap representations with instructions through geometry-aware knowledge fusion. The resulting representations are iteratively refined through pretraining tasks. During reasoning, VGWG generates Guided Attention Heatmaps (GAHs) to explicitly model environment-instruction relevance and optimize waypoint selection. Extensive comparative experiments and performance analyses are conducted, demonstrating that ST-Booster outperforms existing state-of-the-art methods, particularly in complex, disturbance-prone environments.

ST-Booster: An Iterative SpatioTemporal Perception Booster for Vision-and-Language Navigation in Continuous Environments

TL;DR

This work addresses perception challenges in Vision-and-Language Navigation in Continuous Environments (VLN-CE) by introducing ST-Booster, an iterative spatiotemporal booster. It combines Hierarchical SpatioTemporal Encoding (HSTE), Multi-Granularity Aligned Fusion (MGAF), and Value-Guided Waypoint Generation (VGWG) to fuse global topological graphs and local BEV grids, align them with instructions, and generate Guided Attention Heatmaps for instruction-aware waypoint selection. Through weakly supervised GAH training, pretraining (MLM, HSAP, GAHP), and fine-tuning in continuous simulations, ST-Booster achieves state-of-the-art performance, especially in unseen and disturbance-prone environments. The approach advances VLN-CE by enabling robust, interpretable spatiotemporal perception and more reliable decision making in complex 3D environments.

Abstract

Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate unknown, continuous spaces based on natural language instructions. Compared to discrete settings, VLN-CE poses two core perception challenges. First, the absence of predefined observation points leads to heterogeneous visual memories and weakened global spatial correlations. Second, cumulative reconstruction errors in three-dimensional scenes introduce structural noise, impairing local feature perception. To address these challenges, this paper proposes ST-Booster, an iterative spatiotemporal booster that enhances navigation performance through multi-granularity perception and instruction-aware reasoning. ST-Booster consists of three key modules -- Hierarchical SpatioTemporal Encoding (HSTE), Multi-Granularity Aligned Fusion (MGAF), and ValueGuided Waypoint Generation (VGWG). HSTE encodes long-term global memory using topological graphs and captures shortterm local details via grid maps. MGAF aligns these dualmap representations with instructions through geometry-aware knowledge fusion. The resulting representations are iteratively refined through pretraining tasks. During reasoning, VGWG generates Guided Attention Heatmaps (GAHs) to explicitly model environment-instruction relevance and optimize waypoint selection. Extensive comparative experiments and performance analyses are conducted, demonstrating that ST-Booster outperforms existing state-of-the-art methods, particularly in complex, disturbance-prone environments.

Paper Structure

This paper contains 15 sections, 12 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of environmental representation and perception in VLN-CE. (a) Comparison of observations between discrete environments (MP3D) and continuous environments (Habitat); (b) advantages and limitations of topological chen2021topological and grid maps georgakis2022cross; (c) workflow of BEVBert an2023bevbert using dual-map decision fusion; (d) workflow of ST-Booster with iterative dual-map feature fusion and multi-task decision integration.
  • Figure 2: The proposed ST-Booster comprises three core modules. First, the Hierarchical SpatioTemporal Encoding (HSTE) module captures global spatial structures via global topological graphs and extracts local temporal details using local grid-based maps. Next, the Multi-Granularity Aligned Fusion (MGAF) module fuses these heterogeneous map features and aligns the fused representations with linguistic embeddings. Finally, the integrated representations are utilized in the Value Guided Waypoint Generation (VGWG) to predict Guided Attention Heatmaps (GAHs), which adaptively adjust candidate waypoint distributions to instruction-relevant regions.
  • Figure 3: Successful navigation examples under short and long instructions in unseen environments, along with visualizations of predicted and ground-truth GAHs along successful paths. The figure presents four representative cases. The top example illustrates navigation guided by a concise textual instruction (displayed above the navigation map), including the agent’s real-time panoramic observations and top-down views. The middle two examples depict navigation scenarios guided by longer instructions, while the bottom example shows the visualization of predicted and ground-truth GAHs along a successful trajectory.
  • Figure 4: Visualization of GAH generator predictions. (a) Generated GAHs of BEVBert and ST-Booster on val unseen split of VLN-CE. (b) The influence of GAH on candidate waypoint distributions under correct and erroneous predictions. (c) Successful and failed examples generated by GAH.
  • Figure 5: Comparison of navigation performance with respect to task complexity in VLN-CE unseen split. Experimental results demonstrate that the performance of ST-Booster is consistently superior to the selected SOTA methods across all complexity dimensions.
  • ...and 1 more figures