Table of Contents
Fetching ...

UP-Fuse: Uncertainty-guided LiDAR-Camera Fusion for 3D Panoptic Segmentation

Rohit Mohan, Florian Drews, Yakov Miron, Daniele Cattaneo, Abhinav Valada

TL;DR

UP-Fuse addresses the reliability gap in LiDAR–camera fusion for 3D panoptic segmentation by learning an uncertainty-aware fusion that attenuates unreliable visual cues under degradation. The framework projects both modalities into a unified range-view, fuses them with a deformable attention mechanism guided by predicted aleatoric uncertainty, and decodes directly into 3D panoptic masks with a hybrid 2D–3D transformer decoder. Key contributions include the uncertainty-guided fusion module, the 2D–3D panoptic decoder, and a new Panoptic Waymo benchmark derived from the Waymo Open Dataset; extensive experiments demonstrate strong accuracy and robustness under camera dropout, calibration drift, and domain shifts. The proposed approach achieves competitive PQ across Panoptic nuScenes, SemanticKITTI, and Panoptic Waymo while maintaining高 efficiency, making it suitable for safety-critical robotic perception. Overall, UP-Fuse provides a practical, reliability-aware path toward robust multi-modal 3D perception in diverse operating conditions.

Abstract

LiDAR-camera fusion enhances 3D panoptic segmentation by leveraging camera images to complement sparse LiDAR scans, but it also introduces a critical failure mode. Under adverse conditions, degradation or failure of the camera sensor can significantly compromise the reliability of the perception system. To address this problem, we introduce UP-Fuse, a novel uncertainty-aware fusion framework in the 2D range-view that remains robust under camera sensor degradation, calibration drift, and sensor failure. Raw LiDAR data is first projected into the range-view and encoded by a LiDAR encoder, while camera features are simultaneously extracted and projected into the same shared space. At its core, UP-Fuse employs an uncertainty-guided fusion module that dynamically modulates cross-modal interaction using predicted uncertainty maps. These maps are learned by quantifying representational divergence under diverse visual degradations, ensuring that only reliable visual cues influence the fused representation. The fused range-view features are decoded by a novel hybrid 2D-3D transformer that mitigates spatial ambiguities inherent to the 2D projection and directly predicts 3D panoptic segmentation masks. Extensive experiments on Panoptic nuScenes, SemanticKITTI, and our introduced Panoptic Waymo benchmark demonstrate the efficacy and robustness of UP-Fuse, which maintains strong performance even under severe visual corruption or misalignment, making it well suited for robotic perception in safety-critical settings.

UP-Fuse: Uncertainty-guided LiDAR-Camera Fusion for 3D Panoptic Segmentation

TL;DR

UP-Fuse addresses the reliability gap in LiDAR–camera fusion for 3D panoptic segmentation by learning an uncertainty-aware fusion that attenuates unreliable visual cues under degradation. The framework projects both modalities into a unified range-view, fuses them with a deformable attention mechanism guided by predicted aleatoric uncertainty, and decodes directly into 3D panoptic masks with a hybrid 2D–3D transformer decoder. Key contributions include the uncertainty-guided fusion module, the 2D–3D panoptic decoder, and a new Panoptic Waymo benchmark derived from the Waymo Open Dataset; extensive experiments demonstrate strong accuracy and robustness under camera dropout, calibration drift, and domain shifts. The proposed approach achieves competitive PQ across Panoptic nuScenes, SemanticKITTI, and Panoptic Waymo while maintaining高 efficiency, making it suitable for safety-critical robotic perception. Overall, UP-Fuse provides a practical, reliability-aware path toward robust multi-modal 3D perception in diverse operating conditions.

Abstract

LiDAR-camera fusion enhances 3D panoptic segmentation by leveraging camera images to complement sparse LiDAR scans, but it also introduces a critical failure mode. Under adverse conditions, degradation or failure of the camera sensor can significantly compromise the reliability of the perception system. To address this problem, we introduce UP-Fuse, a novel uncertainty-aware fusion framework in the 2D range-view that remains robust under camera sensor degradation, calibration drift, and sensor failure. Raw LiDAR data is first projected into the range-view and encoded by a LiDAR encoder, while camera features are simultaneously extracted and projected into the same shared space. At its core, UP-Fuse employs an uncertainty-guided fusion module that dynamically modulates cross-modal interaction using predicted uncertainty maps. These maps are learned by quantifying representational divergence under diverse visual degradations, ensuring that only reliable visual cues influence the fused representation. The fused range-view features are decoded by a novel hybrid 2D-3D transformer that mitigates spatial ambiguities inherent to the 2D projection and directly predicts 3D panoptic segmentation masks. Extensive experiments on Panoptic nuScenes, SemanticKITTI, and our introduced Panoptic Waymo benchmark demonstrate the efficacy and robustness of UP-Fuse, which maintains strong performance even under severe visual corruption or misalignment, making it well suited for robotic perception in safety-critical settings.
Paper Structure (27 sections, 14 equations, 12 figures, 7 tables)

This paper contains 27 sections, 14 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Visualization of 3D panoptic segmentation: green indicates correct and red denotes errors. LiDAR-camera fusion significantly enhances segmentation over LiDAR-only methods, accurately detecting previously missed vehicles. However, camera sensor failure scenarios reveal a critical vulnerability, with fusion-based performance falling below LiDAR-only baselines and failing to detect previously identified objects. This highlights the crucial need for both relevance and reliability in multi-modal perception.
  • Figure 2: Illustration of the proposed UP-Fuse architecture. LiDAR and multi-view camera images are fused onto a shared space of range-view feature representations. The Uncertainty-Aware Fusion Module adaptively integrates modalities via uncertainty-weighted deformable cross-modal interaction to attenuate unreliable visual cues. Finally, a Hybrid 2D-3D Panoptic Decoder generates 3D predictions. Paths and blocks shown in brown are used only during training.
  • Figure 3: Illustration of our uncertainty module on a Panoptic nuScenes sample. The right column shows predicted uncertainty under increasing synthetic degradations. The jet colormap marks low uncertainty in blue and high uncertainty in red. Mild distortions (b) keep uncertainty low, while strong distortions (c) and sensor dropout (d) produce high uncertainty. Black regions indicate areas without a camera to range-view (RV) mapping.
  • Figure 4: Robust performance comparison on the Panoptic nuScenes validation set under increasing calibration drift between LiDAR and camera over rotation magnitudes from $0^{\circ}$ to $5^{\circ}$. UP-Fuse (red) outperforms all baselines, dropping only $4.4\%$ in PQ compared to $>8\%$ for state-of-the-art methods. Refer to the supplementary material for visualization of the projection shifts.
  • Figure 5: Ablation studies on the key hyperparameters of our Hybrid 2D-3D Panoptic Decoder.
  • ...and 7 more figures