Table of Contents
Fetching ...

Unified Map Prior Encoder for Mapping and Planning

Zongzheng Zhang, Sizhe Zou, Guantian Zheng, Zhenxin Zhu, Yu Gao, Guoxuan Chi, Shuo Wang, Yuwen Heng, Zhigang Sun, Yiru Wang, Hao Sun, Chao Ma, Zhen Li, Anqing Jiang, Hao Zhao

Abstract

Online mapping and end-to-end (E2E) planning in autonomous driving remain largely sensor-centric, leaving rich map priors, including HD/SD vector maps, rasterized SD maps, and satellite imagery, underused because of heterogeneity, pose drift, and inconsistent availability at test time. We present UMPE, a Unified Map Prior Encoder that can ingest any subset of four priors and fuse them with BEV features for both mapping and planning. UMPE has two branches. The vector encoder pre-aligns HD/SD polylines with a frame-wise SE(2) correction, encodes points via multi-frequency sinusoidal features, and produces polyline tokens with confidence scores. BEV queries then apply cross-attention with confidence bias, followed by normalized channel-wise gating to avoid length imbalance and softly down-weight uncertain sources. The raster encoder shares a ResNet-18 backbone conditioned by FiLM with scaling and shift at every stage, performs SE(2) micro-alignment, and injects priors through zero-initialized residual fusion, so the network starts from a do-no-harm baseline and learns to add only useful prior evidence. A vector-then-raster fusion order reflects the inductive bias of geometry first, appearance second. On nuScenes mapping, UMPE lifts MapTRv2 from 61.5 to 67.4 mAP (+5.9) and MapQR from 66.4 to 71.7 mAP (+5.3). On Argoverse2, UMPE adds +4.1 mAP over strong baselines. UMPE is compositional: when trained with all priors, it outperforms single-prior models even when only one prior is available at test time, demonstrating powerset robustness. For E2E planning with the VAD backbone on nuScenes, UMPE reduces trajectory error from 0.72 to 0.42 m L2 on average (-0.30 m) and collision rate from 0.22% to 0.12% (-0.10%), surpassing recent prior-injection methods. These results show that a unified, alignment-aware treatment of heterogeneous map priors yields better mapping and better planning.

Unified Map Prior Encoder for Mapping and Planning

Abstract

Online mapping and end-to-end (E2E) planning in autonomous driving remain largely sensor-centric, leaving rich map priors, including HD/SD vector maps, rasterized SD maps, and satellite imagery, underused because of heterogeneity, pose drift, and inconsistent availability at test time. We present UMPE, a Unified Map Prior Encoder that can ingest any subset of four priors and fuse them with BEV features for both mapping and planning. UMPE has two branches. The vector encoder pre-aligns HD/SD polylines with a frame-wise SE(2) correction, encodes points via multi-frequency sinusoidal features, and produces polyline tokens with confidence scores. BEV queries then apply cross-attention with confidence bias, followed by normalized channel-wise gating to avoid length imbalance and softly down-weight uncertain sources. The raster encoder shares a ResNet-18 backbone conditioned by FiLM with scaling and shift at every stage, performs SE(2) micro-alignment, and injects priors through zero-initialized residual fusion, so the network starts from a do-no-harm baseline and learns to add only useful prior evidence. A vector-then-raster fusion order reflects the inductive bias of geometry first, appearance second. On nuScenes mapping, UMPE lifts MapTRv2 from 61.5 to 67.4 mAP (+5.9) and MapQR from 66.4 to 71.7 mAP (+5.3). On Argoverse2, UMPE adds +4.1 mAP over strong baselines. UMPE is compositional: when trained with all priors, it outperforms single-prior models even when only one prior is available at test time, demonstrating powerset robustness. For E2E planning with the VAD backbone on nuScenes, UMPE reduces trajectory error from 0.72 to 0.42 m L2 on average (-0.30 m) and collision rate from 0.22% to 0.12% (-0.10%), surpassing recent prior-injection methods. These results show that a unified, alignment-aware treatment of heterogeneous map priors yields better mapping and better planning.

Paper Structure

This paper contains 16 sections, 10 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Unified Map Prior Encoder (UMPE). UMPE ingests an arbitrary subset of four map priors—vector (HD/SD vectorized maps) and raster (rasterized SD map, satellite imagery), and processes them via a vector encoder and a raster encoder. The resulting priors are fused with BEV features, supporting both online HD mapping and end-to-end planning tasks.
  • Figure 2: Unified Map Prior Encoder (UMPE) architecture. (a) Vector Encoder: HD/SD polylines are $\mathrm{SE}(2)$ pre-aligned and encoded; BEV queries attend to each source with confidence-biased dual cross-attention. Presence-normalized, channel-wise gating mixes sources to produce fused vector tokens $\bar{\mathbf{Y}}$. (b) Raster Encoder: rasterized SD map and satellite imagery pass through a shared FiLM-conditioned ResNet, then undergo $\mathrm{SE}(2)$ micro-alignment in raster space; channel-wise gating yields fused raster tokens $\bar{\mathbf{Z}}$. (c) Residual fusion: $\bar{\mathbf{Y}}$ and $\bar{\mathbf{Z}}$ are inhected with a learned scalar $\alpha$, producing $\mathbf{X}_{\mathrm{UMPE}}$.
  • Figure 3: Online mapping visualization on nuScenes. Adding UMPE to both MapTRv2 Maptrv2 and MapQR liu2024mapQR produces more accurate maps, especially in the green-highlighted regions: baselines show broken pedestrian crossings, kinked boundaries and missing dividers; UMPE straightens, restores them.
  • Figure 4: End-to-end Planning visualization on nuScenes. The ego vehicle is turning left. VAD without priors drifts toward the oncoming lane; adding the vector encoder or raster encoder improves lane adherence but leaves lateral error, while VAD+UMPE produces a trajectory tightly overleaps the GT.