Table of Contents
Fetching ...

The StreetLearn Environment and Dataset

Piotr Mirowski, Andras Banki-Horvath, Keith Anderson, Denis Teplyashin, Karl Moritz Hermann, Mateusz Malinowski, Matthew Koichi Grimes, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, Raia Hadsell

TL;DR

StreetLearn presents a real-world-inspired interactive navigation environment built on Google Street View, enabling end-to-end, goal-driven visual navigation in city-scale graphs. It defines a courier-style task with absolute goal coordinates, introduces a curriculum for gradually harder goals, and provides an open-source codebase and dataset. Baseline agents (CityNav and MultiCityNav) trained with IMPALA demonstrate strong performance in New York City and face more challenge in Pittsburgh, with generalization to held-out regions and cross-city transfer illustrating both potential and limitations. The work offers a valuable benchmark for grounded, long-range navigation in diverse, photorealistic urban settings and advances understanding of how perception, planning and memory interact under real-world connectivity constraints.

Abstract

Navigation is a rich and well-grounded problem domain that drives progress in many different areas of research: perception, planning, memory, exploration, and optimisation in particular. Historically these challenges have been separately considered and solutions built that rely on stationary datasets - for example, recorded trajectories through an environment. These datasets cannot be used for decision-making and reinforcement learning, however, and in general the perspective of navigation as an interactive learning task, where the actions and behaviours of a learning agent are learned simultaneously with the perception and planning, is relatively unsupported. Thus, existing navigation benchmarks generally rely on static datasets (Geiger et al., 2013; Kendall et al., 2015) or simulators (Beattie et al., 2016; Shah et al., 2018). To support and validate research in end-to-end navigation, we present StreetLearn: an interactive, first-person, partially-observed visual environment that uses Google Street View for its photographic content and broad coverage, and give performance baselines for a challenging goal-driven navigation task. The environment code, baseline agent code, and the dataset are available at http://streetlearn.cc

The StreetLearn Environment and Dataset

TL;DR

StreetLearn presents a real-world-inspired interactive navigation environment built on Google Street View, enabling end-to-end, goal-driven visual navigation in city-scale graphs. It defines a courier-style task with absolute goal coordinates, introduces a curriculum for gradually harder goals, and provides an open-source codebase and dataset. Baseline agents (CityNav and MultiCityNav) trained with IMPALA demonstrate strong performance in New York City and face more challenge in Pittsburgh, with generalization to held-out regions and cross-city transfer illustrating both potential and limitations. The work offers a valuable benchmark for grounded, long-range navigation in diverse, photorealistic urban settings and advances understanding of how perception, planning and memory interact under real-world connectivity constraints.

Abstract

Navigation is a rich and well-grounded problem domain that drives progress in many different areas of research: perception, planning, memory, exploration, and optimisation in particular. Historically these challenges have been separately considered and solutions built that rely on stationary datasets - for example, recorded trajectories through an environment. These datasets cannot be used for decision-making and reinforcement learning, however, and in general the perspective of navigation as an interactive learning task, where the actions and behaviours of a learning agent are learned simultaneously with the perception and planning, is relatively unsupported. Thus, existing navigation benchmarks generally rely on static datasets (Geiger et al., 2013; Kendall et al., 2015) or simulators (Beattie et al., 2016; Shah et al., 2018). To support and validate research in end-to-end navigation, we present StreetLearn: an interactive, first-person, partially-observed visual environment that uses Google Street View for its photographic content and broad coverage, and give performance baselines for a challenging goal-driven navigation task. The environment code, baseline agent code, and the dataset are available at http://streetlearn.cc

Paper Structure

This paper contains 19 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Our environment is built of real-world places from StreetView. The figure shows diverse views and corresponding local maps in New York City (Times Square, Central Park) and London (St. Paul's Cathedral). The green cone represents the agent's location and orientation.
  • Figure 2: Maps with bounding boxes indicating the dataset coverage in New York City (top) and Pittsburgh (bottom).
  • Figure 3: Maps with polygons delimiting the Wall Street (1), Union Square (2) and Hudson (3) regions in New York City (top) and the CMU (4), Allegheny (5) and South Shore (6) regions in Pittsburgh (bottom).
  • Figure 4: Main loop for interacting with the environment.
  • Figure 5: Comparison of architectures. Left: CityNav is a single-city navigation architecture with a policy LSTM, a separate goal LSTM, and optional auxiliary heading ($\theta$). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.
  • ...and 1 more figures