ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

Dong An; Hanqing Wang; Wenguan Wang; Zun Wang; Yan Huang; Keji He; Liang Wang

ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, Liang Wang

TL;DR

This work tackles vision-language navigation in continuous environments (VLN-CE) by introducing ETPNav, a hierarchical system that builds an online topological map from predicted waypoints, uses a transformer-based cross-modal planner to generate long-range navigation plans, and executes plans with an obstacle-avoiding rotate-then-forward controller. A depth-only waypoint predictor, graph-aware self-attention, and a trial-and-error Tryout mechanism address long-horizon planning and obstacle deadlocks in realistic settings. Pre-training on proxy tasks and fine-tuning with student-forcing yield strong generalization, achieving state-of-the-art results on R2R-CE and RxR-CE and winning the RxR-Habitat Challenge. The approach provides a scalable baseline for robust long-range VLN-CE with practical obstacle handling in embodied AI.

Abstract

Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is available at https://github.com/MarSaKi/ETPNav.

ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

TL;DR

Abstract

Paper Structure (26 sections, 3 equations, 6 figures, 11 tables)

This paper contains 26 sections, 3 equations, 6 figures, 11 tables.

Introduction
Related Work
Vision-Language Navigation
VLN in Continuous Environments
Maps for Navigation
Method
Topological Mapping
Cross-Modal Planning
Text Encoder
Cross-Modal Graph Encoder
Control
Training and Inference
Experiment
Experimental Setup
Datasets
...and 11 more sections

Figures (6)

Figure 1: Overview of the proposed model, ETPNav. It consists of three modules, a topological mapping module that gradually updates the topological map as it receives new observations, a cross-modal planning module that computes a navigational plan based on the instruction and map, and a control module that executes the plan with low-level actions.
Figure 2: Illustration of the topological mapping module. It takes the previous graph ($G_{t-1}$) and the agent observation ($O_t$) as input. The waypoint prediction submodule first predicts several nearby waypoints. The graph update submodule organizes these waypoints and incorporates them to update the graph using a waypoint localization function ($\mathcal{F}_L$).
Figure 3: The planning module consists of a text encoder for instruction encoding, and a graph encoder to conduct cross-modal reasoning over the map to generate a path plan.
Figure 4: The effect of the agent's chassis radius on SR.
Figure 5: Comparison of the same episode's trajectories predicted by different model variants. (Top) The trajectory predicted by ETPNav using local planning. (Bottom) The trajectory predicted by ETPNav using global planning.
...and 1 more figures

ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

TL;DR

Abstract

ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (6)