ST-Booster: An Iterative SpatioTemporal Perception Booster for Vision-and-Language Navigation in Continuous Environments
Lu Yue, Dongliang Zhou, Liang Xie, Erwei Yin, Feitian Zhang
TL;DR
This work addresses perception challenges in Vision-and-Language Navigation in Continuous Environments (VLN-CE) by introducing ST-Booster, an iterative spatiotemporal booster. It combines Hierarchical SpatioTemporal Encoding (HSTE), Multi-Granularity Aligned Fusion (MGAF), and Value-Guided Waypoint Generation (VGWG) to fuse global topological graphs and local BEV grids, align them with instructions, and generate Guided Attention Heatmaps for instruction-aware waypoint selection. Through weakly supervised GAH training, pretraining (MLM, HSAP, GAHP), and fine-tuning in continuous simulations, ST-Booster achieves state-of-the-art performance, especially in unseen and disturbance-prone environments. The approach advances VLN-CE by enabling robust, interpretable spatiotemporal perception and more reliable decision making in complex 3D environments.
Abstract
Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate unknown, continuous spaces based on natural language instructions. Compared to discrete settings, VLN-CE poses two core perception challenges. First, the absence of predefined observation points leads to heterogeneous visual memories and weakened global spatial correlations. Second, cumulative reconstruction errors in three-dimensional scenes introduce structural noise, impairing local feature perception. To address these challenges, this paper proposes ST-Booster, an iterative spatiotemporal booster that enhances navigation performance through multi-granularity perception and instruction-aware reasoning. ST-Booster consists of three key modules -- Hierarchical SpatioTemporal Encoding (HSTE), Multi-Granularity Aligned Fusion (MGAF), and ValueGuided Waypoint Generation (VGWG). HSTE encodes long-term global memory using topological graphs and captures shortterm local details via grid maps. MGAF aligns these dualmap representations with instructions through geometry-aware knowledge fusion. The resulting representations are iteratively refined through pretraining tasks. During reasoning, VGWG generates Guided Attention Heatmaps (GAHs) to explicitly model environment-instruction relevance and optimize waypoint selection. Extensive comparative experiments and performance analyses are conducted, demonstrating that ST-Booster outperforms existing state-of-the-art methods, particularly in complex, disturbance-prone environments.
