Table of Contents
Fetching ...

Maps from Motion (MfM): Generating 2D Semantic Maps from Sparse Multi-view Images

Matteo Toso, Stefano Fiorini, Stuart James, Alessio Del Bue

TL;DR

Mapsfrom Motion is a step for-ward to automatize such time-consuming map making procedure by computing 2D maps of semantic objects directly from a collection of uncalibrated multi-view images, and provides extensive evaluation on synthetic and real-world data, showing how the method obtains a solution even in scenarios where standard optimization techniques fail.

Abstract

World-wide detailed 2D maps require enormous collective efforts. OpenStreetMap is the result of 11 million registered users manually annotating the GPS location of over 1.75 billion entries, including distinctive landmarks and common urban objects. At the same time, manual annotations can include errors and are slow to update, limiting the map's accuracy. Maps from Motion (MfM) is a step forward to automatize such time-consuming map making procedure by computing 2D maps of semantic objects directly from a collection of uncalibrated multi-view images. From each image, we extract a set of object detections, and estimate their spatial arrangement in a top-down local map centered in the reference frame of the camera that captured the image. Aligning these local maps is not a trivial problem, since they provide incomplete, noisy fragments of the scene, and matching detections across them is unreliable because of the presence of repeated pattern and the limited appearance variability of urban objects. We address this with a novel graph-based framework, that encodes the spatial and semantic distribution of the objects detected in each image, and learns how to combine them to predict the objects' poses in a global reference system, while taking into account all possible detection matches and preserving the topology observed in each image. Despite the complexity of the problem, our best model achieves global 2D registration with an average accuracy within 4 meters (i.e., below GPS accuracy) even on sparse sequences with strong viewpoint change, on which COLMAP has an 80% failure rate. We provide extensive evaluation on synthetic and real-world data, showing how the method obtains a solution even in scenarios where standard optimization techniques fail.

Maps from Motion (MfM): Generating 2D Semantic Maps from Sparse Multi-view Images

TL;DR

Mapsfrom Motion is a step for-ward to automatize such time-consuming map making procedure by computing 2D maps of semantic objects directly from a collection of uncalibrated multi-view images, and provides extensive evaluation on synthetic and real-world data, showing how the method obtains a solution even in scenarios where standard optimization techniques fail.

Abstract

World-wide detailed 2D maps require enormous collective efforts. OpenStreetMap is the result of 11 million registered users manually annotating the GPS location of over 1.75 billion entries, including distinctive landmarks and common urban objects. At the same time, manual annotations can include errors and are slow to update, limiting the map's accuracy. Maps from Motion (MfM) is a step forward to automatize such time-consuming map making procedure by computing 2D maps of semantic objects directly from a collection of uncalibrated multi-view images. From each image, we extract a set of object detections, and estimate their spatial arrangement in a top-down local map centered in the reference frame of the camera that captured the image. Aligning these local maps is not a trivial problem, since they provide incomplete, noisy fragments of the scene, and matching detections across them is unreliable because of the presence of repeated pattern and the limited appearance variability of urban objects. We address this with a novel graph-based framework, that encodes the spatial and semantic distribution of the objects detected in each image, and learns how to combine them to predict the objects' poses in a global reference system, while taking into account all possible detection matches and preserving the topology observed in each image. Despite the complexity of the problem, our best model achieves global 2D registration with an average accuracy within 4 meters (i.e., below GPS accuracy) even on sparse sequences with strong viewpoint change, on which COLMAP has an 80% failure rate. We provide extensive evaluation on synthetic and real-world data, showing how the method obtains a solution even in scenarios where standard optimization techniques fail.

Paper Structure

This paper contains 22 sections, 11 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: The Maps from Motion (MfM) Concept. From a set of uncalibrated images (a) we extract 2D detections of static urban objects (b), and generate local top-down 2D maps representing the spatial arrangement of the objects with respect to each image (c). We then learn how to register all local maps in the same reference frame (d), to generate a common global map with all objects present in the scene (e).
  • Figure 2: The MfM Pipeline. a) We extract 2D maps representing the spatial arrangement of detected objects, from the image's point of view. b) The maps are encoded as a graph, with a node for each detection, and edges connecting detections from the same image (Same-Map) or with the same class label (Same-class). c) A GNN predicts the location of all object and the cameras in one reference frame.
  • Figure 3: Graph Formulation for the MfM problem. Given a set of local maps, we a) assign to each map annotation a node in the graph and b) draw intra-map edges to generate a complete subgraph for each map. We then c) draw inter-map edges connecting detections matched to the same object. Finally, we d) assign to each node of the graph an embedding defined by concatenating the detection's coordinates in the corresponding local map, the one-hot encoding of its semantic class, and the location of a bounding box fitted to the segmentation mask.
  • Figure 4: Average Euclidean error on the reconstructed object locations ($\mu_o$).