Vision-and-Dialog Navigation
Jesse Thomason, Michael Murray, Maya Cakmak, Luke Zettlemoyer
TL;DR
CVDN introduces a large-scale, photorealistic Vision-and-Dialog Navigation dataset in which two humans cooperatively locate a goal using ambiguous hints and dialog. The paper defines the Navigation from Dialog History task and establishes a multimodal sequence-to-sequence baseline that encodes the full dialog history to predict navigation actions, showing that longer dialog history improves performance and that mixed supervision from humans and planners yields the best results. Key findings indicate that dialog and navigation history are crucial for grounding instructions in dynamic visual contexts, with significant gains in unseen environments. This work provides a foundation for future end-to-end, two-agent systems that jointly navigate and reason via dialog, with potential transfer to real-world robotic assistants.
Abstract
Robots navigating in human environments should use language to ask for assistance and be able to understand human responses. To study this challenge, we introduce Cooperative Vision-and-Dialog Navigation, a dataset of over 2k embodied, human-human dialogs situated in simulated, photorealistic home environments. The Navigator asks questions to their partner, the Oracle, who has privileged access to the best next steps the Navigator should take according to a shortest path planner. To train agents that search an environment for a goal location, we define the Navigation from Dialog History task. An agent, given a target object and a dialog history between humans cooperating to find that object, must infer navigation actions towards the goal in unexplored environments. We establish an initial, multi-modal sequence-to-sequence model and demonstrate that looking farther back in the dialog history improves performance. Sourcecode and a live interface demo can be found at https://cvdn.dev/
