Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation
Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, Shuqiang Jiang
TL;DR
This work tackles vision-and-language navigation in continuous 3D environments where visual occlusions challenge future-environment prediction. It introduces the Hierarchical Neural Radiance Representation (HNR), which predicts multi-level semantic features for future candidate locations using volume rendering and hierarchical encoding guided by CLIP embeddings, avoiding costly pixel-level image reconstruction. Coupled with a Lookahead VLN model, the method builds a navigable future path tree and parallel-evaluates branches via a cross-modal graph transformer, achieving state-of-the-art results on R2R-CE and RxR-CE benchmarks. The approach improves robustness and efficiency of navigation planning in unseen environments, enabling more reliable embodied AI behavior in realistic 3D scenes.
Abstract
Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments. At each navigation step, the agent selects from possible candidate locations and then makes the move. For better navigation planning, the lookahead exploration strategy aims to effectively evaluate the agent's next action by accurately anticipating the future environment of candidate locations. To this end, some existing works predict RGB images for future environments, while this strategy suffers from image distortion and high computational cost. To address these issues, we propose the pre-trained hierarchical neural radiance representation model (HNR) to produce multi-level semantic features for future environments, which are more robust and efficient than pixel-wise RGB reconstruction. Furthermore, with the predicted future environmental representations, our lookahead VLN model is able to construct the navigable future path tree and select the optimal path via efficient parallel evaluation. Extensive experiments on the VLN-CE datasets confirm the effectiveness of our method.
