Table of Contents
Fetching ...

Map It Anywhere (MIA): Empowering Bird's Eye View Mapping using Large-scale Public Data

Cherie Ho, Jiaye Zou, Omar Alama, Sai Mitheran Jagadesh Kumar, Benjamin Chiang, Taneesh Gupta, Chen Wang, Nikhil Keetha, Katia Sycara, Sebastian Scherer

TL;DR

The paper tackles the challenge of generalizable BEV map prediction from monocular FPV inputs by introducing Map It Anywhere (MIA), a data engine that automatically curates world-scale FPV-BEV pairs from Mapillary and OpenStreetMap. MIA assembles roughly 1.2 million FPV-BEV pairs spanning about 470 square kilometers, enabling a diverse training corpus that supports robust cross-domain evaluation. A simple, camera-intrinsics-agnostic model, Mapper, built on a gravity-aligned FPV encoder and a BEV decoder with a DINOv2 backbone, demonstrates strong zero-shot generalization and effective fine-tuning with limited data when trained on MIA. The findings suggest that leveraging large-scale public maps can empower anywhere map prediction, reducing reliance on expensive autonomous-vehicle data and providing a scalable benchmark for future cross-domain BEV perception research.

Abstract

Top-down Bird's Eye View (BEV) maps are a popular representation for ground robot navigation due to their richness and flexibility for downstream tasks. While recent methods have shown promise for predicting BEV maps from First-Person View (FPV) images, their generalizability is limited to small regions captured by current autonomous vehicle-based datasets. In this context, we show that a more scalable approach towards generalizable map prediction can be enabled by using two large-scale crowd-sourced mapping platforms, Mapillary for FPV images and OpenStreetMap for BEV semantic maps. We introduce Map It Anywhere (MIA), a data engine that enables seamless curation and modeling of labeled map prediction data from existing open-source map platforms. Using our MIA data engine, we display the ease of automatically collecting a dataset of 1.2 million pairs of FPV images & BEV maps encompassing diverse geographies, landscapes, environmental factors, camera models & capture scenarios. We further train a simple camera model-agnostic model on this data for BEV map prediction. Extensive evaluations using established benchmarks and our dataset show that the data curated by MIA enables effective pretraining for generalizable BEV map prediction, with zero-shot performance far exceeding baselines trained on existing datasets by 35%. Our analysis highlights the promise of using large-scale public maps for developing & testing generalizable BEV perception, paving the way for more robust autonomous navigation. Website: https://mapitanywhere.github.io/

Map It Anywhere (MIA): Empowering Bird's Eye View Mapping using Large-scale Public Data

TL;DR

The paper tackles the challenge of generalizable BEV map prediction from monocular FPV inputs by introducing Map It Anywhere (MIA), a data engine that automatically curates world-scale FPV-BEV pairs from Mapillary and OpenStreetMap. MIA assembles roughly 1.2 million FPV-BEV pairs spanning about 470 square kilometers, enabling a diverse training corpus that supports robust cross-domain evaluation. A simple, camera-intrinsics-agnostic model, Mapper, built on a gravity-aligned FPV encoder and a BEV decoder with a DINOv2 backbone, demonstrates strong zero-shot generalization and effective fine-tuning with limited data when trained on MIA. The findings suggest that leveraging large-scale public maps can empower anywhere map prediction, reducing reliance on expensive autonomous-vehicle data and providing a scalable benchmark for future cross-domain BEV perception research.

Abstract

Top-down Bird's Eye View (BEV) maps are a popular representation for ground robot navigation due to their richness and flexibility for downstream tasks. While recent methods have shown promise for predicting BEV maps from First-Person View (FPV) images, their generalizability is limited to small regions captured by current autonomous vehicle-based datasets. In this context, we show that a more scalable approach towards generalizable map prediction can be enabled by using two large-scale crowd-sourced mapping platforms, Mapillary for FPV images and OpenStreetMap for BEV semantic maps. We introduce Map It Anywhere (MIA), a data engine that enables seamless curation and modeling of labeled map prediction data from existing open-source map platforms. Using our MIA data engine, we display the ease of automatically collecting a dataset of 1.2 million pairs of FPV images & BEV maps encompassing diverse geographies, landscapes, environmental factors, camera models & capture scenarios. We further train a simple camera model-agnostic model on this data for BEV map prediction. Extensive evaluations using established benchmarks and our dataset show that the data curated by MIA enables effective pretraining for generalizable BEV map prediction, with zero-shot performance far exceeding baselines trained on existing datasets by 35%. Our analysis highlights the promise of using large-scale public maps for developing & testing generalizable BEV perception, paving the way for more robust autonomous navigation. Website: https://mapitanywhere.github.io/
Paper Structure (25 sections, 14 figures, 11 tables)

This paper contains 25 sections, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Our Map It Anywhere (MIA) data engine empowers generalizable Bird's Eye View (BEV) map prediction from First-Person View (FPV) images. Left:MIA enables seamless automatic curation of quality FPV & semantic BEV map data from crowd-sourced platforms, Mapillary & OpenStreetMap. Right: Both as a tool for training & benchmarking, MIA enables research towards anywhere map prediction. A simple model (Mapper) trained on data from MIA better generalizes on both held-out cities (MIA-OOD) & existing benchmarks, while state-of-the-art baselines trained on conventional autonomous vehicle datasets struggle.
  • Figure 2: Overview of how the MIA data engine enables automatic curation of FPV & BEV data. Given names of cities as input from the left, the top row shows FPV processing, while the bottom row depicts BEV processing. Both pipelines converge on the right, producing FPV, BEV, and pose tuples.
  • Figure 3: Comparison of default MapMachine-style rendering with the MIA-style. The figure shows our rendering removes irrelevant information, clusters key semantic categories, aligns better with satellite and is able to provide more accurate sidewalk geometry correctly. Satellite imagery is not part of the MIA data engine and was obtained from gorelick2017google only for tuning map rendering.
  • Figure 4: Samples from the MIA dataset: Highlighting diversity in time of day, seasons, weather and capture scenarios from vehicles & pedestrians.
  • Figure 5: Mapper consistently provides more precise & realistic zero-shot predictions across all the datasets. Notably, Mapper, empowered by MIA data, can produce zero-shot predictions which are comparable to the fully-supervised baselines which have been trained on in-domain data.
  • ...and 9 more figures