Table of Contents
Fetching ...

Vision-and-Dialog Navigation

Jesse Thomason, Michael Murray, Maya Cakmak, Luke Zettlemoyer

TL;DR

CVDN introduces a large-scale, photorealistic Vision-and-Dialog Navigation dataset in which two humans cooperatively locate a goal using ambiguous hints and dialog. The paper defines the Navigation from Dialog History task and establishes a multimodal sequence-to-sequence baseline that encodes the full dialog history to predict navigation actions, showing that longer dialog history improves performance and that mixed supervision from humans and planners yields the best results. Key findings indicate that dialog and navigation history are crucial for grounding instructions in dynamic visual contexts, with significant gains in unseen environments. This work provides a foundation for future end-to-end, two-agent systems that jointly navigate and reason via dialog, with potential transfer to real-world robotic assistants.

Abstract

Robots navigating in human environments should use language to ask for assistance and be able to understand human responses. To study this challenge, we introduce Cooperative Vision-and-Dialog Navigation, a dataset of over 2k embodied, human-human dialogs situated in simulated, photorealistic home environments. The Navigator asks questions to their partner, the Oracle, who has privileged access to the best next steps the Navigator should take according to a shortest path planner. To train agents that search an environment for a goal location, we define the Navigation from Dialog History task. An agent, given a target object and a dialog history between humans cooperating to find that object, must infer navigation actions towards the goal in unexplored environments. We establish an initial, multi-modal sequence-to-sequence model and demonstrate that looking farther back in the dialog history improves performance. Sourcecode and a live interface demo can be found at https://cvdn.dev/

Vision-and-Dialog Navigation

TL;DR

CVDN introduces a large-scale, photorealistic Vision-and-Dialog Navigation dataset in which two humans cooperatively locate a goal using ambiguous hints and dialog. The paper defines the Navigation from Dialog History task and establishes a multimodal sequence-to-sequence baseline that encodes the full dialog history to predict navigation actions, showing that longer dialog history improves performance and that mixed supervision from humans and planners yields the best results. Key findings indicate that dialog and navigation history are crucial for grounding instructions in dynamic visual contexts, with significant gains in unseen environments. This work provides a foundation for future end-to-end, two-agent systems that jointly navigate and reason via dialog, with potential transfer to real-world robotic assistants.

Abstract

Robots navigating in human environments should use language to ask for assistance and be able to understand human responses. To study this challenge, we introduce Cooperative Vision-and-Dialog Navigation, a dataset of over 2k embodied, human-human dialogs situated in simulated, photorealistic home environments. The Navigator asks questions to their partner, the Oracle, who has privileged access to the best next steps the Navigator should take according to a shortest path planner. To train agents that search an environment for a goal location, we define the Navigation from Dialog History task. An agent, given a target object and a dialog history between humans cooperating to find that object, must infer navigation actions towards the goal in unexplored environments. We establish an initial, multi-modal sequence-to-sequence model and demonstrate that looking farther back in the dialog history improves performance. Sourcecode and a live interface demo can be found at https://cvdn.dev/

Paper Structure

This paper contains 30 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: In Cooperative Vision-and-Dialog Navigation, two humans are given a hint about an object $t_o$ in the goal room. The Navigator moves ($N$) through the simulated environment to find the goal room, and can stop at any time to type a question ($Q$) to the Oracle. The Oracle has a privileged view of the best next steps ($O$) according to a shortest path planner, and uses that information to answer ($A$) the question. The dialog continues until the Navigator stops in the goal room.
  • Figure 2: The distributions of steps taken by human Navigators versus a shortest path planner (Left), the number of word tokens from the Navigator and the Oracle (Center), and the number of utterances in dialogs across the CVDN dataset.
  • Figure 3: We use a sequence-to-sequence model with an LSTM encoder that takes in learnable token embeddings (LE) of the dialog history. The encoder conditions an LSTM decoder for predicting navigation actions that takes in fixed ResNet embeddings of visual environment frames. Here, we demarcate subsequences in the input (e.g., $t_o$) compared during input ablations.
  • Figure 4: The distribution of the 81 target objects $t_o$ in dialogs across CVDN.
  • Figure 5: Left: The IoU of nodes in the paths of human Navigator and shortest path planner trajectories in CVDN versus those in R2R when comparing paths in the same scan. Right: The IoU of Navigator and shortest path planner trajectories in the same scan versus the IoU of player and shortest path planner trajectories across a dialog.
  • ...and 1 more figures