Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation

Huilin Tian; Jingke Meng; Wei-Shi Zheng; Yuan-Ming Li; Junkai Yan; Yunong Zhang

Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation

Huilin Tian, Jingke Meng, Wei-Shi Zheng, Yuan-Ming Li, Junkai Yan, Yunong Zhang

TL;DR

This work tackles outdoor Vision-and-Language Navigation by emphasizing spatial localization prior to planning. It introduces Loc4Plan, which combines a Block-Aware Spatial Locating (BAL) module to estimate block-level position and a Spatial-Aware Action Planning (SAP) module to ground instructions using hierarchical, spatially informed representations. The approach leverages a block concept, long-term turning angles, and a hierarchical semantic association to align spatial state with sentence- and token-level guidance, trained with losses L_AP, L_BAL, and L_HSA. Experiments on Touchdown and map2seq show state-of-the-art performance in seen and unseen scenarios, underscoring the practical value of incorporating spatial localization into outdoor VLN grounding and planning.

Abstract

Vision and Language Navigation (VLN) is a challenging task that requires agents to understand instructions and navigate to the destination in a visual environment.One of the key challenges in outdoor VLN is keeping track of which part of the instruction was completed. To alleviate this problem, previous works mainly focus on grounding the natural language to the visual input, but neglecting the crucial role of the agent's spatial position information in the grounding process. In this work, we first explore the substantial effect of spatial position locating on the grounding of outdoor VLN, drawing inspiration from human navigation. In real-world navigation scenarios, before planning a path to the destination, humans typically need to figure out their current location. This observation underscores the pivotal role of spatial localization in the navigation process. In this work, we introduce a novel framework, Locating be for Planning (Loc4Plan), designed to incorporate spatial perception for action planning in outdoor VLN tasks. The main idea behind Loc4Plan is to perform the spatial localization before planning a decision action based on corresponding guidance, which comprises a block-aware spatial locating (BAL) module and a spatial-aware action planning (SAP) module. Specifically, to help the agent perceive its spatial location in the environment, we propose to learn a position predictor that measures how far the agent is from the next intersection for reflecting its position, which is achieved by the BAL module. After the locating process, we propose the SAP module to incorporate spatial information to ground the corresponding guidance and enhance the precision of action planning. Extensive experiments on the Touchdown and map2seq datasets show that the proposed Loc4Plan outperforms the SOTA methods.

Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation

TL;DR

Abstract

Paper Structure (16 sections, 16 equations, 5 figures, 6 tables)

This paper contains 16 sections, 16 equations, 5 figures, 6 tables.

Introduction
Related Works
Vision-and-Language Navigation
Textual Grounding in VLN
Locating Before Planning
Preliminary
Model Overview
Block-Aware Spatial Locating
Spatial-Aware Action Planning
The Overall Training
Experiments
Experimental Setup
Comparisons with SOTA VLN Methods
Ablation Studies
Visualization and Qualitative Analysis
...and 1 more sections

Figures (5)

Figure 1: The illustration of navigation process of our locating before planning approach. During the locating phase, the agent locates its relative spatial position in the current block. In the planning phase, the agent associates the corresponding guidance to follow and makes an action decision to take (i.e., FORWARD).
Figure 2: The overall framework of our proposed Loc4Plan.The image and text encoder extract features of visual observation and instructions, respectively. Initially, the block-aware spatial locating (BAL) serves to leverage the visual representation and spatial information (i.e. junction-type embedding, heading delta) of trajectory to locate the agent's position relative to the current block. Then we identify the corresponding guidance that the agent needs to follow by associating spatial-aware state representation with provided instructions in a hierarchical manner, ranging from sentence-level to token-level granularity. Finally, the agent further incorporates spatial locating information into action decision planning.
Figure 3: Performance comparison with different instruction lengths and trajectory complexity.
Figure 4: Qualitative results of block process score (Eq. (\ref{['eq:BAL_4']})) prediction in the BAL module. The green polyline represents the ground-truth block process across the entire trajectory, while the purple polyline depicts the corresponding predictions made by our method. The red brackets divided the navigation into multiple stages, with each stage encompassing nodes that belong to a single block.
Figure 5: Visualization of the sentence relevance scores (Eq. (\ref{['eq:HSA_2']})) in HSA module.①-⑦ indicate the range of each sentence in the instruction. The heatmap shows the degree of attention of our model to each sentence at each step. The red arrows pointed out three key navigation step, whose corresponding node positions in the scene graph are labeled with white box.

Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation

TL;DR

Abstract

Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)