Table of Contents
Fetching ...

To Ask or Not to Ask? Detecting Absence of Information in Vision and Language Navigation

Savitha Sam Abraham, Sourav Garg, Feras Dayoub

TL;DR

This paper addresses how agents can recognize “when” they lack sufficient information, without focusing on “what” is missing, particularly in VLN tasks with vague instructions, and proposes an attention-based instruction-vagueness estimation module that learns associations between instructions and the agent's trajectory.

Abstract

Recent research in Vision Language Navigation (VLN) has overlooked the development of agents' inquisitive abilities, which allow them to ask clarifying questions when instructions are incomplete. This paper addresses how agents can recognize "when" they lack sufficient information, without focusing on "what" is missing, particularly in VLN tasks with vague instructions. Equipping agents with this ability enhances efficiency by reducing potential digressions and seeking timely assistance. The challenge in identifying such uncertain points is balancing between being overly cautious (high recall) and overly confident (high precision). We propose an attention-based instruction-vagueness estimation module that learns associations between instructions and the agent's trajectory. By leveraging instruction-to-path alignment information during training, the module's vagueness estimation performance improves by around 52% in terms of precision-recall balance. In our ablative experiments, we also demonstrate the effectiveness of incorporating this additional instruction-to-path attention network alongside the cross-modal attention networks within the navigator module. Our results show that the attention scores from the instruction-to-path attention network serve as better indicators for estimating vagueness.

To Ask or Not to Ask? Detecting Absence of Information in Vision and Language Navigation

TL;DR

This paper addresses how agents can recognize “when” they lack sufficient information, without focusing on “what” is missing, particularly in VLN tasks with vague instructions, and proposes an attention-based instruction-vagueness estimation module that learns associations between instructions and the agent's trajectory.

Abstract

Recent research in Vision Language Navigation (VLN) has overlooked the development of agents' inquisitive abilities, which allow them to ask clarifying questions when instructions are incomplete. This paper addresses how agents can recognize "when" they lack sufficient information, without focusing on "what" is missing, particularly in VLN tasks with vague instructions. Equipping agents with this ability enhances efficiency by reducing potential digressions and seeking timely assistance. The challenge in identifying such uncertain points is balancing between being overly cautious (high recall) and overly confident (high precision). We propose an attention-based instruction-vagueness estimation module that learns associations between instructions and the agent's trajectory. By leveraging instruction-to-path alignment information during training, the module's vagueness estimation performance improves by around 52% in terms of precision-recall balance. In our ablative experiments, we also demonstrate the effectiveness of incorporating this additional instruction-to-path attention network alongside the cross-modal attention networks within the navigator module. Our results show that the attention scores from the instruction-to-path attention network serve as better indicators for estimating vagueness.

Paper Structure

This paper contains 17 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Navigation agent paths using different instruction-vagueness estimation approaches: (1) Overly confident approach (red) that rarely seeks help, (2) Overly cautious approach (blue) that seeks help very often and (3) Our balanced approach (green) that seeks timely assistance. Dashed arrows indicate movements made with external assistance (best viewed in color).
  • Figure 2: Interaction between the navigator and the IV module: The IV module receives the encoded instruction and the path taken so far with a suggestion for the next move from the navigator. It predicts the certainty in the navigator's next move suggestion.
  • Figure 3: Pre-training: Architecture of the network that learns to identify the most relevant span or chunk in $\hat{I}$ that influenced the last move made, $N_{\text{t+1}}$.
  • Figure 4: A navigation example with the input $I_{short}$ showing the agent trajectories when supported by $f_{CP}$ (blue), $f_{Base} (red)$ and our approach $f_{IV(GP+pretrain)}$ (green). Dashed arrows indicate movements with oracle intervention (best viewed in color).
  • Figure 5: Number of oracle interventions vs. trajectories for $f_{\text{CP}}$, $f_{\text{Base}}$, $f_{\text{VDN}}$ and ours - $f_{\text{IV(GP+pretrain)}}$ (best viewed in color).