Table of Contents
Fetching ...

MLANet: Multi-Level Attention Network with Sub-instruction for Continuous Vision-and-Language Navigation

Zongtao He, Liuyi Wang, Shu Li, Qingqing Yan, Chengju Liu, Qijun Chen

TL;DR

The paper tackles continuous Vision-and-Language Navigation (VLN) by introducing MLANet, which integrates a fast sub-instruction generator (FSA), a multi-level attention module (MLA), and a peak attention loss (PAL) to improve real-time grounding of instructions in unseen environments. Key innovations include an annotation-free FSA that produces FSASub, a fusion mechanism that combines low- and high-level linguistic signals with vision through MLA, and PAL that nudges attention to a single sub-instruction to support long-horizon planning. Empirical results on the VLN-CE benchmark with FSASub show substantial improvements over baselines, including strong gains with auxiliary training and RL, along with ablations validating each component. The work advances practical, robust VLN by delivering modular, reusable components for better instruction understanding and navigation, and it releases FSASub to support future research.

Abstract

Vision-and-Language Navigation (VLN) aims to develop intelligent agents to navigate in unseen environments only through language and vision supervision. In the recently proposed continuous settings (continuous VLN), the agent must act in a free 3D space and faces tougher challenges like real-time execution, complex instruction understanding, and long action sequence prediction. For a better performance in continuous VLN, we design a multi-level instruction understanding procedure and propose a novel model, Multi-Level Attention Network (MLANet). The first step of MLANet is to generate sub-instructions efficiently. We design a Fast Sub-instruction Algorithm (FSA) to segment the raw instruction into sub-instructions and generate a new sub-instruction dataset named ``FSASub". FSA is annotation-free and faster than the current method by 70 times, thus fitting the real-time requirement in continuous VLN. To solve the complex instruction understanding problem, MLANet needs a global perception of the instruction and observations. We propose a Multi-Level Attention (MLA) module to fuse vision, low-level semantics, and high-level semantics, which produce features containing a dynamic and global comprehension of the task. MLA also mitigates the adverse effects of noise words, thus ensuring a robust understanding of the instruction. To correctly predict actions in long trajectories, MLANet needs to focus on what sub-instruction is being executed every step. We propose a Peak Attention Loss (PAL) to improve the flexible and adaptive selection of the current sub-instruction. PAL benefits the navigation agent by concentrating its attention on the local information, thus helping the agent predict the most appropriate actions. We train and test MLANet in the standard benchmark. Experiment results show MLANet outperforms baselines by a significant margin.

MLANet: Multi-Level Attention Network with Sub-instruction for Continuous Vision-and-Language Navigation

TL;DR

The paper tackles continuous Vision-and-Language Navigation (VLN) by introducing MLANet, which integrates a fast sub-instruction generator (FSA), a multi-level attention module (MLA), and a peak attention loss (PAL) to improve real-time grounding of instructions in unseen environments. Key innovations include an annotation-free FSA that produces FSASub, a fusion mechanism that combines low- and high-level linguistic signals with vision through MLA, and PAL that nudges attention to a single sub-instruction to support long-horizon planning. Empirical results on the VLN-CE benchmark with FSASub show substantial improvements over baselines, including strong gains with auxiliary training and RL, along with ablations validating each component. The work advances practical, robust VLN by delivering modular, reusable components for better instruction understanding and navigation, and it releases FSASub to support future research.

Abstract

Vision-and-Language Navigation (VLN) aims to develop intelligent agents to navigate in unseen environments only through language and vision supervision. In the recently proposed continuous settings (continuous VLN), the agent must act in a free 3D space and faces tougher challenges like real-time execution, complex instruction understanding, and long action sequence prediction. For a better performance in continuous VLN, we design a multi-level instruction understanding procedure and propose a novel model, Multi-Level Attention Network (MLANet). The first step of MLANet is to generate sub-instructions efficiently. We design a Fast Sub-instruction Algorithm (FSA) to segment the raw instruction into sub-instructions and generate a new sub-instruction dataset named ``FSASub". FSA is annotation-free and faster than the current method by 70 times, thus fitting the real-time requirement in continuous VLN. To solve the complex instruction understanding problem, MLANet needs a global perception of the instruction and observations. We propose a Multi-Level Attention (MLA) module to fuse vision, low-level semantics, and high-level semantics, which produce features containing a dynamic and global comprehension of the task. MLA also mitigates the adverse effects of noise words, thus ensuring a robust understanding of the instruction. To correctly predict actions in long trajectories, MLANet needs to focus on what sub-instruction is being executed every step. We propose a Peak Attention Loss (PAL) to improve the flexible and adaptive selection of the current sub-instruction. PAL benefits the navigation agent by concentrating its attention on the local information, thus helping the agent predict the most appropriate actions. We train and test MLANet in the standard benchmark. Experiment results show MLANet outperforms baselines by a significant margin.
Paper Structure (27 sections, 12 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 27 sections, 12 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Improvement by leveraging sub-instructions. With only word attention (left), the agent misunderstands the task at initial steps and finally reaches a wrong position. Sub-instruction (right) strengthens useful information and weakens the impact of noise words ("room" in the third sentence is noise when conducting initial steps), so the agent chooses a correct direction and successfully navigates to the target position.
  • Figure 2: An illustration of the model architecture. The input contains an RGB image, a depth image, a raw instruction, and sub-instructions from FSA. After encoders and GRU memory, instruction features are fused by MLA, and vision features are fused by spatial attention. The action decoder accepts all hidden features and outputs action. Attention score of high-level attention will be used for PAL. Gray shading and bold fonts mark important components.
  • Figure 3: An inner sight of the MLA module. Two visual hidden states are inputted as attention queries, and multi-level instruction features are inputted as keys and values. After attention blocks, high-level and low-level outputs are fused by a fully-connected layer to produce a dynamic global perception of the instruction.
  • Figure 4: The role of PAL. When there are multiple peaks in the actual score (orange line, triangle mark), PAL encourages the global maximum point and weakens other local maximum points, making the actual score closer to the expected score (blue line, round mark).
  • Figure 5: The kernel density estimate (KDE) plot of sub-instruction numbers. A higher sub-instruction number means there are more sub-instructions in an instruction.
  • ...and 3 more figures