Table of Contents
Fetching ...

PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation

Renjie Lu, Jingke Meng, Wei-Shi Zheng

TL;DR

This work addresses vision-and-language navigation by rethinking planning as alignment between natural-language instructions and directed fidelity trajectories on a directed graph. PRET replaces expensive full-map planning with orientation-aware edge representations and two compact transformers (MAM and CCM) to evaluate and select the next target along directed trajectories, while leveraging KV-cache to reduce computation. Empirically, PRET matches or surpasses state-of-the-art methods on R2R and RxR with substantially lower computational cost, and ablations confirm the importance of directionality and the modular planning components. The approach offers a scalable, efficient pathway to robust VLN that leverages directed topology and trajectory-level reasoning, with practical implications for real-time navigation in complex environments.

Abstract

Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction. Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning. However, they suffer from high computational cost when attempting to support such high-level predictions with GCN-like models. In this work, we propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories, which refers to a path from the initial node to the candidate locations on a directed graph without detours. This planning strategy leads to an efficient model while achieving strong performance. Specifically, we introduce a directed graph to illustrate the explored area of the environment, emphasizing directionality. Then, we firstly define the trajectory representation as a sequence of directed edge features, which are extracted from the panorama based on the corresponding orientation. Ultimately, we assess and compare the alignment between instruction and different trajectories during navigation to determine the next navigation target. Our method outperforms previous SOTA method BEVBert on RxR dataset and is comparable on R2R dataset while largely reducing the computational cost. Code is available: https://github.com/iSEE-Laboratory/VLN-PRET.

PRET: Planning with Directed Fidelity Trajectory for Vision and Language Navigation

TL;DR

This work addresses vision-and-language navigation by rethinking planning as alignment between natural-language instructions and directed fidelity trajectories on a directed graph. PRET replaces expensive full-map planning with orientation-aware edge representations and two compact transformers (MAM and CCM) to evaluate and select the next target along directed trajectories, while leveraging KV-cache to reduce computation. Empirically, PRET matches or surpasses state-of-the-art methods on R2R and RxR with substantially lower computational cost, and ablations confirm the importance of directionality and the modular planning components. The approach offers a scalable, efficient pathway to robust VLN that leverages directed topology and trajectory-level reasoning, with practical implications for real-time navigation in complex environments.

Abstract

Vision and language navigation is a task that requires an agent to navigate according to a natural language instruction. Recent methods predict sub-goals on constructed topology map at each step to enable long-term action planning. However, they suffer from high computational cost when attempting to support such high-level predictions with GCN-like models. In this work, we propose an alternative method that facilitates navigation planning by considering the alignment between instructions and directed fidelity trajectories, which refers to a path from the initial node to the candidate locations on a directed graph without detours. This planning strategy leads to an efficient model while achieving strong performance. Specifically, we introduce a directed graph to illustrate the explored area of the environment, emphasizing directionality. Then, we firstly define the trajectory representation as a sequence of directed edge features, which are extracted from the panorama based on the corresponding orientation. Ultimately, we assess and compare the alignment between instruction and different trajectories during navigation to determine the next navigation target. Our method outperforms previous SOTA method BEVBert on RxR dataset and is comparable on R2R dataset while largely reducing the computational cost. Code is available: https://github.com/iSEE-Laboratory/VLN-PRET.
Paper Structure (29 sections, 8 equations, 5 figures, 5 tables)

This paper contains 29 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of SPLSPL and GFLOPs on R2R test unseen split dataset. Our method is comparable with previous SOTA methods while being more computational efficient. The computational cost of text encoder and visual encoder is omitted for fair comparison.
  • Figure 2: Illustration of our approaches. (a) shows our directed graph representation. Each edge is assigned with an orientation-aware panorama feature. (b) depict our planning method. We select an unvisited node(colored yellow) to navigate towards next by choosing the fidelity path(colored red) that best aligned with instruction.
  • Figure 3: Illustration of our model. (a) is the overall framework of our method. At each step, we update the graph, extract path embeddings, and predict actions. (b) depicts the matching assessment module(MAM). Each token is an edge feature. We compute path embeddings for each newly observed nodes with cross-modal transformer and impose a causal mask to reduce computational cost. (c) shows the candidate comparison module(CCM). We gather path embeddings of unvisited nodes and forward them into a single layer transformer followed by a MLP to predict temporary target.
  • Figure 4: Comparison of orientation panoramic view and single candidate view.
  • Figure 5: (a) Visualization of the agent's navigation process, showcasing its ability to learn a backtracking strategy. (b) Visualizing attention weights in OPE to illustrate the distinction between undirected and directed trajectory representations.