Table of Contents
Fetching ...

Neural Semantic Map-Learning for Autonomous Vehicles

Markus Herb, Nassir Navab, Federico Tombari

TL;DR

This work presents a mapping system that fuses local submaps gathered from a fleet of vehicles at a central instance to produce a coherent map of the road environment including drivable area, lane markings, poles, obstacles and more as a 3D mesh.

Abstract

Autonomous vehicles demand detailed maps to maneuver reliably through traffic, which need to be kept up-to-date to ensure a safe operation. A promising way to adapt the maps to the ever-changing road-network is to use crowd-sourced data from a fleet of vehicles. In this work, we present a mapping system that fuses local submaps gathered from a fleet of vehicles at a central instance to produce a coherent map of the road environment including drivable area, lane markings, poles, obstacles and more as a 3D mesh. Each vehicle contributes locally reconstructed submaps as lightweight meshes, making our method applicable to a wide range of reconstruction methods and sensor modalities. Our method jointly aligns and merges the noisy and incomplete local submaps using a scene-specific Neural Signed Distance Field, which is supervised using the submap meshes to predict a fused environment representation. We leverage memory-efficient sparse feature-grids to scale to large areas and introduce a confidence score to model uncertainty in scene reconstruction. Our approach is evaluated on two datasets with different local mapping methods, showing improved pose alignment and reconstruction over existing methods. Additionally, we demonstrate the benefit of multi-session mapping and examine the required amount of data to enable high-fidelity map learning for autonomous vehicles.

Neural Semantic Map-Learning for Autonomous Vehicles

TL;DR

This work presents a mapping system that fuses local submaps gathered from a fleet of vehicles at a central instance to produce a coherent map of the road environment including drivable area, lane markings, poles, obstacles and more as a 3D mesh.

Abstract

Autonomous vehicles demand detailed maps to maneuver reliably through traffic, which need to be kept up-to-date to ensure a safe operation. A promising way to adapt the maps to the ever-changing road-network is to use crowd-sourced data from a fleet of vehicles. In this work, we present a mapping system that fuses local submaps gathered from a fleet of vehicles at a central instance to produce a coherent map of the road environment including drivable area, lane markings, poles, obstacles and more as a 3D mesh. Each vehicle contributes locally reconstructed submaps as lightweight meshes, making our method applicable to a wide range of reconstruction methods and sensor modalities. Our method jointly aligns and merges the noisy and incomplete local submaps using a scene-specific Neural Signed Distance Field, which is supervised using the submap meshes to predict a fused environment representation. We leverage memory-efficient sparse feature-grids to scale to large areas and introduce a confidence score to model uncertainty in scene reconstruction. Our approach is evaluated on two datasets with different local mapping methods, showing improved pose alignment and reconstruction over existing methods. Additionally, we demonstrate the benefit of multi-session mapping and examine the required amount of data to enable high-fidelity map learning for autonomous vehicles.

Paper Structure

This paper contains 34 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Our neural multi-session mapping approach fuses dense submaps created from different agents through local dense SLAM by training a neural field to predict dense semantics and geometry that best explains the submaps collected from multiple mapping sessions.
  • Figure 2: System Pipeline overview of our method: We collect meshes and poses $\mathcal{P}$ for each local agent mapping session, cluster them into different geographic tiles $t$, and use them to supervise the neural semantic signed distance field. This consists of the feature grids $\Phi_t$ for each tile and shared geometry head $\mathcal{H}_\text{Geo}$ predicting SDF $s(x) = \Omega_\text{sdf}(x)$ and confidence $c(x) = \Omega_\text{conf}(x)$ and semantic head $\mathcal{H}_\text{Sem}$ predicting logits $l(x) = \Omega_\text{sem}(x)$, conditioned on grid features $\mathcal{F}$ and point embeddings $\mathcal{E}$. We jointly optimize submap poses $\mathcal{P}_i$, neural field grids $\Phi_t$ and decoder heads $\mathcal{H}$. The final fused map reconstruction can be conveniently extracted from the neural field using Marching Cubes.
  • Figure 3: Qualitative reconstruction results for KITTI360 (top) and HD-Map (bottom).
  • Figure 4: Reconstruction results for varying number of sessions and submap sizes on HD-Map dataset (a) Increasing the number of sessions improves not just semantic F-Score, but also recall and precision. (b) Semantic F-Score for varying submap sizes leading to different amount of transmitted data per mapped area.
  • Figure 5: (a) Comparison of semantic F-Score for different HashGrid codebook sizes $N_C$ and OcTree feature sizes $N_F$ resulting in different map sizes (neural field weights).
  • ...and 2 more figures