Table of Contents
Fetching ...

Talk the Walk: Navigating New York City through Grounded Dialogue

Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, Douwe Kiela

TL;DR

<3-5 sentence high-level summary> Talk The Walk introduces a large-scale, grounded dialogue dataset where a guide and a tourist coordinate to navigate to a target location using perception, action, and natural language. The authors propose Masked Attention for Spatial Convolutions (MASC) to grounding tourist observations and actions into a 2D overhead map, enabling strong localization under both emergent and natural-language communication. Through extensive experiments, MASC yields significant gains over baselines, with emergent-language localization reaching near or above human performance under certain perception assumptions, and natural-language grounding demonstrating the challenges and potential gains from generated utterances. The work provides baseline performance for the full task, analyzes the role of actions, perception, and trajectory length, and contributes a valuable benchmark and architectural tool for grounded language learning in embodied navigation contexts.

Abstract

We introduce "Talk The Walk", the first large-scale dialogue dataset grounded in action and perception. The task involves two agents (a "guide" and a "tourist") that communicate via natural language in order to achieve a common goal: having the tourist navigate to a given target location. The task and dataset, which are described in detail, are challenging and their full solution is an open problem that we pose to the community. We (i) focus on the task of tourist localization and develop the novel Masked Attention for Spatial Convolutions (MASC) mechanism that allows for grounding tourist utterances into the guide's map, (ii) show it yields significant improvements for both emergent and natural language communication, and (iii) using this method, we establish non-trivial baselines on the full task.

Talk the Walk: Navigating New York City through Grounded Dialogue

TL;DR

<3-5 sentence high-level summary> Talk The Walk introduces a large-scale, grounded dialogue dataset where a guide and a tourist coordinate to navigate to a target location using perception, action, and natural language. The authors propose Masked Attention for Spatial Convolutions (MASC) to grounding tourist observations and actions into a 2D overhead map, enabling strong localization under both emergent and natural-language communication. Through extensive experiments, MASC yields significant gains over baselines, with emergent-language localization reaching near or above human performance under certain perception assumptions, and natural-language grounding demonstrating the challenges and potential gains from generated utterances. The work provides baseline performance for the full task, analyzes the role of actions, perception, and trajectory length, and contributes a valuable benchmark and architectural tool for grounded language learning in embodied navigation contexts.

Abstract

We introduce "Talk The Walk", the first large-scale dialogue dataset grounded in action and perception. The task involves two agents (a "guide" and a "tourist") that communicate via natural language in order to achieve a common goal: having the tourist navigate to a given target location. The task and dataset, which are described in detail, are challenging and their full solution is an open problem that we pose to the community. We (i) focus on the task of tourist localization and develop the novel Masked Attention for Spatial Convolutions (MASC) mechanism that allows for grounding tourist utterances into the guide's map, (ii) show it yields significant improvements for both emergent and natural language communication, and (iii) using this method, we establish non-trivial baselines on the full task.

Paper Structure

This paper contains 62 sections, 8 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Example of the Talk The Walk task: two agents, a "tourist" and a "guide", interact with each other via natural language in order to have the tourist navigate towards the correct location. The guide has access to a map and knows the target location but not the tourist location, while the tourist does not have a map and is tasked with navigating a 360-degree street view environment.
  • Figure 2: We show MASC values of two action sequences for tourist localization via discrete communication with $T=3$ actions. In general, we observe that the first action always corresponds to the correct state-transition, whereas the second and third are sometimes mixed. For instance, in the top example, the first two actions are correctly predicted but the third action is not (as the MASC corresponds to a "no action"). In the bottom example, the second action appears as the third MASC.
  • Figure 3: Result of running the text recognizer of gupta16text on four examples of the Hell's Kitchen neighborhood. Top row: two positive examples. Bottom row: example of false negative (left) and many false positives (right)
  • Figure 4: Frequency of landmark classes
  • Figure 5: Map of New York City with red rectangles indicating the captured neighborhoods of the Talk The Walk dataset.
  • ...and 3 more figures