Table of Contents
Fetching ...

SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction

Haoxiang Fu, Lingfeng Zhang, Hao Li, Ruibing Hu, Zhengrong Li, Guanjing Liu, Zimu Tan, Long Chen, Hangjun Ye, Xiaoshuai Hao

TL;DR

This work proposes SEFMAP, a Subspace-Expert Fusion framework for robust multimodal HD map prediction, and introduces an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse.

Abstract

High-definition (HD) maps are essential for autonomous driving, yet multi-modal fusion often suffers from inconsistency between camera and LiDAR modalities, leading to performance degradation under low-light conditions, occlusions, or sparse point clouds. To address this, we propose SEFMAP, a Subspace-Expert Fusion framework for robust multimodal HD map prediction. The key idea is to explicitly disentangle BEV features into four semantic subspaces: LiDAR-private, Image-private, Shared, and Interaction. Each subspace is assigned a dedicated expert, thereby preserving modality-specific cues while capturing cross-modal consensus. To adaptively combine expert outputs, we introduce an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse. To enhance robustness in degraded conditions and promote role specialization, we further propose distribution-aware masking: during training, modality-drop scenarios are simulated using EMA-statistical surrogate features, and a specialization loss enforces distinct behaviors of private, shared, and interaction experts across complete and masked inputs. Experiments on nuScenes and Argoverse2 benchmarks demonstrate that SEFMAP achieves state-of-the-art performance, surpassing prior methods by +4.2% and +4.8% in mAP, respectively. SEF-MAPprovides a robust and effective solution for multi-modal HD map prediction under diverse and degraded conditions.

SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction

TL;DR

This work proposes SEFMAP, a Subspace-Expert Fusion framework for robust multimodal HD map prediction, and introduces an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse.

Abstract

High-definition (HD) maps are essential for autonomous driving, yet multi-modal fusion often suffers from inconsistency between camera and LiDAR modalities, leading to performance degradation under low-light conditions, occlusions, or sparse point clouds. To address this, we propose SEFMAP, a Subspace-Expert Fusion framework for robust multimodal HD map prediction. The key idea is to explicitly disentangle BEV features into four semantic subspaces: LiDAR-private, Image-private, Shared, and Interaction. Each subspace is assigned a dedicated expert, thereby preserving modality-specific cues while capturing cross-modal consensus. To adaptively combine expert outputs, we introduce an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse. To enhance robustness in degraded conditions and promote role specialization, we further propose distribution-aware masking: during training, modality-drop scenarios are simulated using EMA-statistical surrogate features, and a specialization loss enforces distinct behaviors of private, shared, and interaction experts across complete and masked inputs. Experiments on nuScenes and Argoverse2 benchmarks demonstrate that SEFMAP achieves state-of-the-art performance, surpassing prior methods by +4.2% and +4.8% in mAP, respectively. SEF-MAPprovides a robust and effective solution for multi-modal HD map prediction under diverse and degraded conditions.
Paper Structure (16 sections, 18 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 18 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overall architecture of SEF-MAP framework. The system takes multi-view images and LiDAR point clouds as inputs, which are encoded into BEV features through respective encoders. These features are then decomposed into four semantic subspaces (LiDAR-private, Image-private, Shared, and Interaction) via linear transformations. During training, distribution-aware masking creates surrogate features using EMA statistics to simulate modality degradation scenarios, with specialization losses enforcing expert roles. During inference, only the intact forward pass is performed without masking. A mixture of experts processes each subspace, and an uncertainty-aware gating mechanism dynamically weights expert outputs based on their predicted variance to generate the final HD map prediction.
  • Figure 2: Qualitative results on nuScenes. We present two sample scenes from nuScenes: (a) Ground Truth. (b) Baseline (MapTR). (c) SEF-MAP (SD and DAM). (d) SEF-MAP (full). The red boxes highlight challenging regions where our method demonstrates significant improvements over the baseline. In these highlighted areas, the baseline MapTR produces noticeable errors and incomplete map predictions. The full SEF-MAP model with uncertainty-aware gating (d) achieves the most precise vectorized map reconstruction, particularly in the challenging regions marked by red boxes, demonstrating the effectiveness of our subspace decomposition and adaptive fusion strategy.