Table of Contents
Fetching ...

Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

Hongyuan Mei, Mohit Bansal, Matthew R. Walter

TL;DR

The paper addresses translating natural-language navigational instructions into executable action sequences using an end-to-end neural sequence-to-sequence model conditioned on the local world state. It introduces a bidirectional LSTM encoder, a multi-level aligner that leverages both low-level word and high-level representations, and an LSTM decoder that generates actions. The approach achieves state-of-the-art results on single-sentence navigation benchmarks and competitive performance on multi-sentence tasks with limited training data, without relying on linguistic resources such as parsers or seed lexicons. These findings demonstrate robust grounding of language in action for navigational tasks and suggest broad applicability to autonomous agents operating in unfamiliar environments.

Abstract

We propose a neural sequence-to-sequence model for direction following, a task that is essential to realizing effective autonomous agents. Our alignment-based encoder-decoder model with long short-term memory recurrent neural networks (LSTM-RNN) translates natural language instructions to action sequences based upon a representation of the observable world state. We introduce a multi-level aligner that empowers our model to focus on sentence "regions" salient to the current world state by using multiple abstractions of the input sentence. In contrast to existing methods, our model uses no specialized linguistic resources (e.g., parsers) or task-specific annotations (e.g., seed lexicons). It is therefore generalizable, yet still achieves the best results reported to-date on a benchmark single-sentence dataset and competitive results for the limited-training multi-sentence setting. We analyze our model through a series of ablations that elucidate the contributions of the primary components of our model.

Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

TL;DR

The paper addresses translating natural-language navigational instructions into executable action sequences using an end-to-end neural sequence-to-sequence model conditioned on the local world state. It introduces a bidirectional LSTM encoder, a multi-level aligner that leverages both low-level word and high-level representations, and an LSTM decoder that generates actions. The approach achieves state-of-the-art results on single-sentence navigation benchmarks and competitive performance on multi-sentence tasks with limited training data, without relying on linguistic resources such as parsers or seed lexicons. These findings demonstrate robust grounding of language in action for navigational tasks and suggest broad applicability to autonomous agents operating in unfamiliar environments.

Abstract

We propose a neural sequence-to-sequence model for direction following, a task that is essential to realizing effective autonomous agents. Our alignment-based encoder-decoder model with long short-term memory recurrent neural networks (LSTM-RNN) translates natural language instructions to action sequences based upon a representation of the observable world state. We introduce a multi-level aligner that empowers our model to focus on sentence "regions" salient to the current world state by using multiple abstractions of the input sentence. In contrast to existing methods, our model uses no specialized linguistic resources (e.g., parsers) or task-specific annotations (e.g., seed lexicons). It is therefore generalizable, yet still achieves the best results reported to-date on a benchmark single-sentence dataset and competitive results for the limited-training multi-sentence setting. We analyze our model through a series of ablations that elucidate the contributions of the primary components of our model.

Paper Structure

This paper contains 22 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An example of a route instruction-path pair in one of the virtual worlds from ? ( ?) with colors that indicate floor patterns and wall paintings, and letters that indicate different objects. Our method successfully infers the correct path for this instruction.
  • Figure 2: Our encoder-aligner-decoder model with multi-level alignment
  • Figure 3: Long Short-term Memory (LSTM) unit.
  • Figure 4: Visualization of the alignment between words to actions in a map for a multi-sentence instruction.