Table of Contents
Fetching ...

CSMapping: Scalable Crowdsourced Semantic Mapping and Topology Inference for Autonomous Driving

Zhijian Qiao, Zehuan Yu, Tong Li, Chih-Chung Chou, Wenchao Ding, Shaojie Shen

TL;DR

CSMapping tackles the challenge of scalable, high-quality semantic and topological maps for autonomous driving from crowdsourced data. It blends a learned HD-map latent diffusion prior with a vectorized initialization and latent-space MAP optimization, including a Gaussian-basis reparameterization and multi-start posterior scoring to robustly complete incomplete observations. For topology, it introduces confidence-weighted k-medoids clustering with kinematic refinement to produce drivable centerlines that improve as data grows. Extensive experiments on nuScenes, Argoverse 2, and proprietary datasets demonstrate state-of-the-art performance, with strong ablations and scalability analyses across training and inference, as well as practical benefits for online perception. The work enables scalable map construction that progressively improves with data and offers a practical framework for online detection enhancement and cross-submap consistency via factor-graph optimization.

Abstract

Crowdsourcing enables scalable autonomous driving map construction, but low-cost sensor noise hinders quality from improving with data volume. We propose CSMapping, a system that produces accurate semantic maps and topological road centerlines whose quality consistently increases with more crowdsourced data. For semantic mapping, we train a latent diffusion model on HD maps (optionally conditioned on SD maps) to learn a generative prior of real-world map structure, without requiring paired crowdsourced/HD-map supervision. This prior is incorporated via constrained MAP optimization in latent space, ensuring robustness to severe noise and plausible completion in unobserved areas. Initialization uses a robust vectorized mapping module followed by diffusion inversion; optimization employs efficient Gaussian-basis reparameterization, projected gradient descent zobracket multi-start, and latent-space factor-graph for global consistency. For topological mapping, we apply confidence-weighted k-medoids clustering and kinematic refinement to trajectories, yielding smooth, human-like centerlines robust to trajectory variation. Experiments on nuScenes, Argoverse 2, and a large proprietary dataset achieve state-of-the-art semantic and topological mapping performance, with thorough ablation and scalability studies.

CSMapping: Scalable Crowdsourced Semantic Mapping and Topology Inference for Autonomous Driving

TL;DR

CSMapping tackles the challenge of scalable, high-quality semantic and topological maps for autonomous driving from crowdsourced data. It blends a learned HD-map latent diffusion prior with a vectorized initialization and latent-space MAP optimization, including a Gaussian-basis reparameterization and multi-start posterior scoring to robustly complete incomplete observations. For topology, it introduces confidence-weighted k-medoids clustering with kinematic refinement to produce drivable centerlines that improve as data grows. Extensive experiments on nuScenes, Argoverse 2, and proprietary datasets demonstrate state-of-the-art performance, with strong ablations and scalability analyses across training and inference, as well as practical benefits for online perception. The work enables scalable map construction that progressively improves with data and offers a practical framework for online detection enhancement and cross-submap consistency via factor-graph optimization.

Abstract

Crowdsourcing enables scalable autonomous driving map construction, but low-cost sensor noise hinders quality from improving with data volume. We propose CSMapping, a system that produces accurate semantic maps and topological road centerlines whose quality consistently increases with more crowdsourced data. For semantic mapping, we train a latent diffusion model on HD maps (optionally conditioned on SD maps) to learn a generative prior of real-world map structure, without requiring paired crowdsourced/HD-map supervision. This prior is incorporated via constrained MAP optimization in latent space, ensuring robustness to severe noise and plausible completion in unobserved areas. Initialization uses a robust vectorized mapping module followed by diffusion inversion; optimization employs efficient Gaussian-basis reparameterization, projected gradient descent zobracket multi-start, and latent-space factor-graph for global consistency. For topological mapping, we apply confidence-weighted k-medoids clustering and kinematic refinement to trajectories, yielding smooth, human-like centerlines robust to trajectory variation. Experiments on nuScenes, Argoverse 2, and a large proprietary dataset achieve state-of-the-art semantic and topological mapping performance, with thorough ablation and scalability studies.

Paper Structure

This paper contains 74 sections, 1 theorem, 53 equations, 30 figures, 5 tables.

Key Result

Lemma 1

Consider a minimum-cost warping path $\pi$ connecting two points $s$ and $t$ in the parameter space. For any segment $\sigma \subseteq \pi$ bounded by intermediate points $s'$ and $t'$, the segment $\sigma$ necessarily achieves the minimum cost among all valid paths from $s'$ to $t'$.

Figures (30)

  • Figure 1: CSMapping leverages crowdsourced vectorized observations, including road semantic detections and vehicle trajectories collected at scale, to respectively produce semantic maps and topological centerlines. Despite noisy observations and incomplete coverage, the system robustly reconstructs observed regions and provides plausible generation of unobserved regions (orange dashed regions). A top-down RGB basemap visualizes the coverage of crowdsourced observations.
  • Figure 2: Comparison of mapping paradigms. (a) Classical mapping employs gradient descent (GD) optimization with hand-crafted likelihoods for accurate results, though it relies on weak or absent priors. (b) Feed-forward mapping learns the posterior directly, enabling fast inference but demanding paired data supervision. (c) Our approach learns a prior on HD maps without requiring paired supervision and conducts scene-specific MAP estimation to construct accurate and complete maps.
  • Figure 3: Data association for vectorized observations. (a) Two local semantic maps from different sessions (cyan and orange). (b) Limited overlap between observations from different sessions. (c) Instance segmentation with different colors; rejected regions (red and violet dashed circles represent issue 1 and 2 in Section \ref{['subsec:data_association']}, respectively) indicate portions of vectorized observations to be excluded.
  • Figure 4: (a) Continuous curve matching via CDTW, where the C$\rightarrow$A correspondence is obtained through sequential C$\rightarrow$B and B$\rightarrow$A propagation. Non-overlapping portions are marked as invalid matches (gray dashed lines). (b) Cost fields for B$\rightarrow$A and C$\rightarrow$B matching, with black lines denoting optimal warping paths. The cost is defined as the Euclidean distance between corresponding points on the two curves.
  • Figure 5: Illustration of latent diffusion and inversion. (a) Latent diffusion and denoising process with decoded maps along the trajectory. (b) From left to right: initial map from vectorized mapping, inverted latent from diffusion inversion and its generated map, and a random Gaussian latent with its generated map.
  • ...and 25 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof