Table of Contents
Fetching ...

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, Caiming Xiong

TL;DR

The paper tackles Vision-and-Language Navigation by equipping an agent with visual-textual co-grounding to locate past/next instructions and a progress monitor that explicitly estimates completion toward the goal. This self-monitoring design is integrated into a seq2seq framework with panoramic vision, using a joint training objective and progress-aware beam search during inference. Empirical results on the Room-to-Room benchmark show state-of-the-art performance, including an 8-point absolute improvement in unseen environments, with ablations confirming the value of co-grounding and progress estimation. The work advances interpretable, goal-directed navigation and suggests broader applicability of self-monitoring in complex, instruction-following tasks.

Abstract

The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments. This challenging task demands that the agent be aware of which instruction was completed, which instruction is needed next, which way to go, and its navigation progress towards the goal. In this paper, we introduce a self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress. We test our self-monitoring agent on a standard benchmark and analyze our proposed approach through a series of ablation studies that elucidate the contributions of the primary components. Using our proposed method, we set the new state of the art by a significant margin (8% absolute increase in success rate on the unseen test set). Code is available at https://github.com/chihyaoma/selfmonitoring-agent .

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

TL;DR

The paper tackles Vision-and-Language Navigation by equipping an agent with visual-textual co-grounding to locate past/next instructions and a progress monitor that explicitly estimates completion toward the goal. This self-monitoring design is integrated into a seq2seq framework with panoramic vision, using a joint training objective and progress-aware beam search during inference. Empirical results on the Room-to-Room benchmark show state-of-the-art performance, including an 8-point absolute improvement in unseen environments, with ablations confirming the value of co-grounding and progress estimation. The work advances interpretable, goal-directed navigation and suggests broader applicability of self-monitoring in complex, instruction-following tasks.

Abstract

The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments. This challenging task demands that the agent be aware of which instruction was completed, which instruction is needed next, which way to go, and its navigation progress towards the goal. In this paper, we introduce a self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress. We test our self-monitoring agent on a standard benchmark and analyze our proposed approach through a series of ablation studies that elucidate the contributions of the primary components. Using our proposed method, we set the new state of the art by a significant margin (8% absolute increase in success rate on the unseen test set). Code is available at https://github.com/chihyaoma/selfmonitoring-agent .

Paper Structure

This paper contains 11 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Vision-and-Language Navigation task and our proposed self-monitoring agent. The agent is constantly aware of what was completed, what is next, and where to go, as it navigates through unknown environments by following navigational instructions.
  • Figure 2: Proposed self-monitoring agent consisting of visual-textual co-grounding, progress monitoring, and action selection modules. Textual grounding: identify which part of the instruction has been completed or ongoing and which part is potentially needed for next action. Visual grounding: summarize the observed surrounding images. Progress monitor: regularize and ensure grounded instruction reflects progress towards the goal. Action selection: identify which direction to go.
  • Figure 3: The positions and weights of grounded instructions as agents navigate by following instructions. Our self-monitoring agent with progress monitor demonstrates the grounded instruction used for action selection shifts gradually from the beginning of instructions towards the end. This is not true of the baseline method.
  • Figure 4: Successful self-monitoring agent navigates in two unseen environments. The agent is able to correctly follow the grounded instruction and achieve the goal successfully. The percentage of instruction completeness estimated by the proposed progress monitor gradually increases as the agent navigates and approaches the goal. Finally, the agent grounded the word "Stop" to stop (see the supplementary material for full figures).
  • Figure 5: Successful self-monitoring agent navigates in two different unseen environments. Given the navigational instruction located at the top of the figure, the agent starts from starting position and follows the instruction towards the goal. The percentage of instruction completeness estimated by the proposed progress monitor gradually increases as the agent navigates and approaches the goal.
  • ...and 2 more figures