P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation

Tianfu Li; Wenbo Chen; Haoxuan Xu; Xinhu Zheng; Haoang Li

P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation

Tianfu Li, Wenbo Chen, Haoxuan Xu, Xinhu Zheng, Haoang Li

Abstract

In Vision-and-Language Navigation (VLN), an agent is required to plan a path to the target specified by the language instruction, using its visual observations. Consequently, prevailing VLN methods primarily focus on building powerful planners through visual-textual alignment. However, these approaches often bypass the imperative of comprehensive scene understanding prior to planning, leaving the agent with insufficient perception or prediction capabilities. Thus, we propose P$^{3}$Nav, a novel end-to-end framework integrating perception, prediction, and planning in a unified pipeline to strengthen the VLN agent's scene understanding and boost navigation success. Specifically, P$^{3}$Nav augments perception by extracting complementary cues from object-level and map-level perspectives. Subsequently, our P$^{3}$Nav predicts waypoints to model the agent's potential future states, endowing the agent with intrinsic awareness of candidate positions during navigation. Conditioned on these future waypoints, P$^{3}$Nav further forecasts semantic map cues, enabling proactive planning and reducing the strict reliance on purely historical context. Integrating these perceptual and predictive cues, a holistic planning module finally carries out the VLN tasks. Extensive experiments demonstrate that our P$^{3}$Nav achieves new state-of-the-art performance on the REVERIE, R2R-CE, and RxR-CE benchmarks.

P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation

Abstract

Nav, a novel end-to-end framework integrating perception, prediction, and planning in a unified pipeline to strengthen the VLN agent's scene understanding and boost navigation success. Specifically, P

Nav augments perception by extracting complementary cues from object-level and map-level perspectives. Subsequently, our P

Nav predicts waypoints to model the agent's potential future states, endowing the agent with intrinsic awareness of candidate positions during navigation. Conditioned on these future waypoints, P

Nav further forecasts semantic map cues, enabling proactive planning and reducing the strict reliance on purely historical context. Integrating these perceptual and predictive cues, a holistic planning module finally carries out the VLN tasks. Extensive experiments demonstrate that our P

Nav achieves new state-of-the-art performance on the REVERIE, R2R-CE, and RxR-CE benchmarks.

Paper Structure (21 sections, 22 equations, 12 figures, 7 tables)

This paper contains 21 sections, 22 equations, 12 figures, 7 tables.

Introduction
Related Work
Methodology
Perception: Objects and Map Semantics
Prediction: Waypoints and Future Scene
Planning
Experiments
Experimental Setup
Comparison with State-of-the-Art
Analysis of Intermediate Modules
Case Studies
Ablation Studies
Conclusion
Problem Formulation
Evaluation Datasets and Metrics
...and 6 more sections

Figures (12)

Figure 1: Motivation of our method. (a) Scene information, such as objects, maps, and waypoints are crucial for grounding instructions to visual observations. (b.1) Early models are of a "planning-only" structure, with limited scene understanding due to implicit feature extraction and aggregation. (b.2) Recent methods build external perception/prediction modules, but suffer from information loss and error accumulation. (b.3) Our method unifies perception, prediction, and planning in a single network, through a unified environment representation with end-to-end feature propagation between each stage.
Figure 2: Pipeline of our P$^{3}$Nav model. The agent first encodes its discrete observations into a BEV representation, upon which two perception decoders output object and map features, in parallel. Subsequently, two predictors sequentially decode waypoint cues and future scene semantics. Ultimately, a planning decoder incorporates the features from all the preceding modules to produce a comprehensive navigation decision.
Figure 3: Ground truth map semantics generation. First, we crop nearby regions from the point clouds with semantic annotations in the Matterport3D MP3D dataset, and project them onto a plane. Second, we generate a template-based description of the map. Then we prompt the VLM to refine the description. The last token from the last VLM decoder layer is defined as the ground truth map semantics.
Figure 4: Pipeline of waypoint-level prediction. (a) Waypoint feature is decoded via the multi-attention deformable transformer decoder; after that, a heatmap is generated through up-sampling. (b) NMS and depth-based post-processing are performed to identify candidate waypoints, addressing the discrete-to-continuous transfer problem.
Figure 5: Analysis of (a) object-level perception (mAP and mAR at IoU=0.5) and (b) waypoint-level prediction. BTG denotes BridgeGap, with BTG-B and BTG-U indicating its baseline version and U-Net ronneberger2015u variants, respectively.
...and 7 more figures

P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation

Abstract

P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation

Authors

Abstract

Table of Contents

Figures (12)