Table of Contents
Fetching ...

Failure Modes for Deep Learning-Based Online Mapping: How to Measure and Address Them

Michael Hubbertz, Qi Han, Tobias Meisen

Abstract

Deep learning-based online mapping has emerged as a cornerstone of autonomous driving, yet these models frequently fail to generalize beyond familiar environments. We propose a framework to identify and measure the underlying failure modes by disentangling two effects: Memorization of input features and overfitting to known map geometries. We propose measures based on evaluation subsets that control for geographical proximity and geometric similarity between training and validation scenes. We introduce Fréchet distance-based reconstruction statistics that capture per-element shape fidelity without threshold tuning, and define complementary failure-mode scores: a localization overfitting score quantifying the performance drop when geographic cues disappear, and a map geometry overfitting score measuring degradation as scenes become geometrically novel. Beyond models, we analyze dataset biases and contribute map geometry-aware diagnostics: A minimum-spanning-tree (MST) diversity measure for training sets and a symmetric coverage measure to quantify geometric similarity between splits. Leveraging these, we formulate an MST-based sparsification strategy that reduces redundancy and improves balancing and performance while shrinking training size. Experiments on nuScenes and Argoverse 2 across multiple state-of-the-art models yield more trustworthy assessment of generalization and show that map geometry-diverse and balanced training sets lead to improved performance. Our results motivate failure-mode-aware protocols and map geometry-centric dataset design for deployable online mapping.

Failure Modes for Deep Learning-Based Online Mapping: How to Measure and Address Them

Abstract

Deep learning-based online mapping has emerged as a cornerstone of autonomous driving, yet these models frequently fail to generalize beyond familiar environments. We propose a framework to identify and measure the underlying failure modes by disentangling two effects: Memorization of input features and overfitting to known map geometries. We propose measures based on evaluation subsets that control for geographical proximity and geometric similarity between training and validation scenes. We introduce Fréchet distance-based reconstruction statistics that capture per-element shape fidelity without threshold tuning, and define complementary failure-mode scores: a localization overfitting score quantifying the performance drop when geographic cues disappear, and a map geometry overfitting score measuring degradation as scenes become geometrically novel. Beyond models, we analyze dataset biases and contribute map geometry-aware diagnostics: A minimum-spanning-tree (MST) diversity measure for training sets and a symmetric coverage measure to quantify geometric similarity between splits. Leveraging these, we formulate an MST-based sparsification strategy that reduces redundancy and improves balancing and performance while shrinking training size. Experiments on nuScenes and Argoverse 2 across multiple state-of-the-art models yield more trustworthy assessment of generalization and show that map geometry-diverse and balanced training sets lead to improved performance. Our results motivate failure-mode-aware protocols and map geometry-centric dataset design for deployable online mapping.
Paper Structure (32 sections, 28 equations, 10 figures, 11 tables)

This paper contains 32 sections, 28 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Validation set performance of different state-of-the-art online mapping models on the nuScenes dataset caesar_nuscenes_2020 with original and two different geographically disjoint lilja_localization_2024yuan_streammapnet_2024 splits.
  • Figure 2: Correlation between $d(v)$ and $s(v)$ for the nuScenes original split (Pearson correlation coefficient $r = 0.724$). Three representative pairs of $s(v)$ are presented, showing each validation sample alongside its closest geometric match from the training set.
  • Figure 3: Visualization of two exemplary prediction and ground truth map element pairs. The Chamfer distance (red) and Fréchet distance (green) are shown as performance metrics for both cases. In example (a), both metrics produce meaningful results. In example (b), the Chamfer distance remains nearly unchanged compared to (a) because it ignores point ordering, whereas the Fréchet distance yields a much higher value, capturing the larger geometric deviation.
  • Figure 4: Sample-wise performance of MapTRv2 measured by $M$ for the nuScenes original split, plotted over $d(v)$ (left) and $s(v)$ (right). In the left plot, an exemplary $T_\text{dist} = 5\,m$ is displayed, separating the validation set $V$ into $V_\text{close}$ and $V_\text{far}$. Performance is positively correlated with both $d(v)$ and $s(v)$, with the correlation being stronger for $s(v)$ (Pearson correlation coefficient $r = 0.568 > 0.379$).
  • Figure 5: Effect of MST-based training set sparsification on sample size and diversity (Top) and model performance for MapTRv2 on the validation set (Bottom). Besides examining the effect across all examined nuScenes and Argoverse 2 splits, we randomly sample from the training sets of the original splits for comparison.
  • ...and 5 more figures