Table of Contents
Fetching ...

Learning to Navigate in Cities Without a Map

Piotr Mirowski, Matthew Koichi Grimes, Mateusz Malinowski, Karl Moritz Hermann, Keith Anderson, Denis Teplyashin, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, Raia Hadsell

TL;DR

The paper tackles city-scale visual navigation without maps by introducing StreetLearn, a Street View–derived RL environment. It presents a dual-pathway, goal-conditioned architecture ( locale-specific LSTMs plus a general policy) and landmark-based goal representations to enable transfer across cities. Through curriculum learning and transfer experiments, the authors show robust navigation in multiple cities and demonstrate how pre-training on several regions improves adaptation to new ones. The work provides a realistic benchmark and a scalable, modular framework for end-to-end navigation in real-world environments, with resources released for wider use.

Abstract

Navigating through unstructured environments is a basic capability of intelligent creatures, and thus is of fundamental interest in the study and development of artificial intelligence. Long-range navigation is a complex cognitive task that relies on developing an internal representation of space, grounded by recognisable landmarks and robust visual processing, that can simultaneously support continuous self-localisation ("I am here") and a representation of the goal ("I am going there"). Building upon recent research that applies deep reinforcement learning to maze navigation problems, we present an end-to-end deep reinforcement learning approach that can be applied on a city scale. Recognising that successful navigation relies on integration of general policies with locale-specific knowledge, we propose a dual pathway architecture that allows locale-specific features to be encapsulated, while still enabling transfer to multiple cities. We present an interactive navigation environment that uses Google StreetView for its photographic content and worldwide coverage, and demonstrate that our learning method allows agents to learn to navigate multiple cities and to traverse to target destinations that may be kilometres away. The project webpage http://streetlearn.cc contains a video summarising our research and showing the trained agent in diverse city environments and on the transfer task, the form to request the StreetLearn dataset and links to further resources. The StreetLearn environment code is available at https://github.com/deepmind/streetlearn

Learning to Navigate in Cities Without a Map

TL;DR

The paper tackles city-scale visual navigation without maps by introducing StreetLearn, a Street View–derived RL environment. It presents a dual-pathway, goal-conditioned architecture ( locale-specific LSTMs plus a general policy) and landmark-based goal representations to enable transfer across cities. Through curriculum learning and transfer experiments, the authors show robust navigation in multiple cities and demonstrate how pre-training on several regions improves adaptation to new ones. The work provides a realistic benchmark and a scalable, modular framework for end-to-end navigation in real-world environments, with resources released for wider use.

Abstract

Navigating through unstructured environments is a basic capability of intelligent creatures, and thus is of fundamental interest in the study and development of artificial intelligence. Long-range navigation is a complex cognitive task that relies on developing an internal representation of space, grounded by recognisable landmarks and robust visual processing, that can simultaneously support continuous self-localisation ("I am here") and a representation of the goal ("I am going there"). Building upon recent research that applies deep reinforcement learning to maze navigation problems, we present an end-to-end deep reinforcement learning approach that can be applied on a city scale. Recognising that successful navigation relies on integration of general policies with locale-specific knowledge, we propose a dual pathway architecture that allows locale-specific features to be encapsulated, while still enabling transfer to multiple cities. We present an interactive navigation environment that uses Google StreetView for its photographic content and worldwide coverage, and demonstrate that our learning method allows agents to learn to navigate multiple cities and to traverse to target destinations that may be kilometres away. The project webpage http://streetlearn.cc contains a video summarising our research and showing the trained agent in diverse city environments and on the transfer task, the form to request the StreetLearn dataset and links to further resources. The StreetLearn environment code is available at https://github.com/deepmind/streetlearn

Paper Structure

This paper contains 23 sections, 1 equation, 10 figures.

Figures (10)

  • Figure 1: (a) Our environment is built of real-world places from Street View (we illustrate Times Square and Central Park in New York City and St. Paul's Cathedral in London). The green cone represents the agent's location and orientation. (b) We use large regions of London and Paris and in New York we focus on 5 different regions to show transfer.
  • Figure 2: (a) In the illustration of the goal description, we show a set of 5 nearby landmarks and 4 distant ones; the code $g_i$ is a vector with a softmax-normalised distance to each landmark. (b)Left: GoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav is a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading ($\theta$). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.
  • Figure 3: Average per-episode rewards (y axis) are plotted vs. learning steps (x axis) for the courier task. We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip connection on the NYU environment (a), and the CityNav agent in London (b). We also give Oracle performance and a Heuristic agent. A curriculum is used in London---we indicate the end of phase 1 (up to 500m) and the end of phase 2 (5000m). (c) Results of the CityNav agent on NYU, comparing radii of early rewards (ER) vs. ER with random coins vs. curriculum with ER 200m and no coins.
  • Figure 4: (a) Number of steps required for the CityNav agent to reach a goal from 100 start locations vs. the straight-line distance to the goal in metres. (b)CityNav performance in London (left panes) and NYU (right panes). Top: examples of the agent's trajectory during one 1000-step episode, showing successful consecutive goal acquisitions. The arrows show the direction of travel of the agent. Bottom: We visualise the agent's value function over 100 trajectories with random starting points and the same goal. Thicker and warmer colour lines correspond to higher value functions.
  • Figure 5: Illustration of medium-sized held-out grid with gray corresponding to training destinations, black corresponding to held-out test destinations. Landmark locations are marked in red.
  • ...and 5 more figures