Table of Contents
Fetching ...

Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, Shuqiang Jiang

TL;DR

This work tackles vision-and-language navigation in continuous 3D environments where visual occlusions challenge future-environment prediction. It introduces the Hierarchical Neural Radiance Representation (HNR), which predicts multi-level semantic features for future candidate locations using volume rendering and hierarchical encoding guided by CLIP embeddings, avoiding costly pixel-level image reconstruction. Coupled with a Lookahead VLN model, the method builds a navigable future path tree and parallel-evaluates branches via a cross-modal graph transformer, achieving state-of-the-art results on R2R-CE and RxR-CE benchmarks. The approach improves robustness and efficiency of navigation planning in unseen environments, enabling more reliable embodied AI behavior in realistic 3D scenes.

Abstract

Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments. At each navigation step, the agent selects from possible candidate locations and then makes the move. For better navigation planning, the lookahead exploration strategy aims to effectively evaluate the agent's next action by accurately anticipating the future environment of candidate locations. To this end, some existing works predict RGB images for future environments, while this strategy suffers from image distortion and high computational cost. To address these issues, we propose the pre-trained hierarchical neural radiance representation model (HNR) to produce multi-level semantic features for future environments, which are more robust and efficient than pixel-wise RGB reconstruction. Furthermore, with the predicted future environmental representations, our lookahead VLN model is able to construct the navigable future path tree and select the optimal path via efficient parallel evaluation. Extensive experiments on the VLN-CE datasets confirm the effectiveness of our method.

Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation

TL;DR

This work tackles vision-and-language navigation in continuous 3D environments where visual occlusions challenge future-environment prediction. It introduces the Hierarchical Neural Radiance Representation (HNR), which predicts multi-level semantic features for future candidate locations using volume rendering and hierarchical encoding guided by CLIP embeddings, avoiding costly pixel-level image reconstruction. Coupled with a Lookahead VLN model, the method builds a navigable future path tree and parallel-evaluates branches via a cross-modal graph transformer, achieving state-of-the-art results on R2R-CE and RxR-CE benchmarks. The approach improves robustness and efficiency of navigation planning in unseen environments, enabling more reliable embodied AI behavior in realistic 3D scenes.

Abstract

Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments. At each navigation step, the agent selects from possible candidate locations and then makes the move. For better navigation planning, the lookahead exploration strategy aims to effectively evaluate the agent's next action by accurately anticipating the future environment of candidate locations. To this end, some existing works predict RGB images for future environments, while this strategy suffers from image distortion and high computational cost. To address these issues, we propose the pre-trained hierarchical neural radiance representation model (HNR) to produce multi-level semantic features for future environments, which are more robust and efficient than pixel-wise RGB reconstruction. Furthermore, with the predicted future environmental representations, our lookahead VLN model is able to construct the navigable future path tree and select the optimal path via efficient parallel evaluation. Extensive experiments on the VLN-CE datasets confirm the effectiveness of our method.
Paper Structure (33 sections, 15 equations, 12 figures, 7 tables)

This paper contains 33 sections, 15 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Illustration of different methods to represent the navigable candidate locations. (a) uses the single-view observation (yellow sector area). (b) uses the panorama of the candidate location (blue circular area) to anticipate the future environment.
  • Figure 2: The framework of the hierarchical neural radiance representation model (HNR). The HNR model encodes the observed environments (yellow area) into the feature cloud. Through aggregating k-nearest features, the MLP network predicts the latent vector and volume density of sampled points along the rendered ray. A region-level representation is encoded by compositing these latent vectors via volume rendering, then a view encoder is used to encode all region-level representations within a future view (red area) and obtain an entire future view representation. All future views of the candidate location can be combined as a panorama (blue area) to support navigation.
  • Figure 3: Illustration of the volume rendering method and hierarchical encoding.
  • Figure 4: The framework of the lookahead VLN model. In addition to the stop embedding (black), three types of nodes are used to structure the topological map: visited nodes (yellow), candidate nodes (blue) and lookahead nodes (red).
  • Figure 5: Average cosine similarity between predicted future views and ground truth at different distances between candidate locations and agent, on the val unseen split of the R2R-CE dataset.
  • ...and 7 more figures