UP-Fuse: Uncertainty-guided LiDAR-Camera Fusion for 3D Panoptic Segmentation
Rohit Mohan, Florian Drews, Yakov Miron, Daniele Cattaneo, Abhinav Valada
TL;DR
UP-Fuse addresses the reliability gap in LiDAR–camera fusion for 3D panoptic segmentation by learning an uncertainty-aware fusion that attenuates unreliable visual cues under degradation. The framework projects both modalities into a unified range-view, fuses them with a deformable attention mechanism guided by predicted aleatoric uncertainty, and decodes directly into 3D panoptic masks with a hybrid 2D–3D transformer decoder. Key contributions include the uncertainty-guided fusion module, the 2D–3D panoptic decoder, and a new Panoptic Waymo benchmark derived from the Waymo Open Dataset; extensive experiments demonstrate strong accuracy and robustness under camera dropout, calibration drift, and domain shifts. The proposed approach achieves competitive PQ across Panoptic nuScenes, SemanticKITTI, and Panoptic Waymo while maintaining高 efficiency, making it suitable for safety-critical robotic perception. Overall, UP-Fuse provides a practical, reliability-aware path toward robust multi-modal 3D perception in diverse operating conditions.
Abstract
LiDAR-camera fusion enhances 3D panoptic segmentation by leveraging camera images to complement sparse LiDAR scans, but it also introduces a critical failure mode. Under adverse conditions, degradation or failure of the camera sensor can significantly compromise the reliability of the perception system. To address this problem, we introduce UP-Fuse, a novel uncertainty-aware fusion framework in the 2D range-view that remains robust under camera sensor degradation, calibration drift, and sensor failure. Raw LiDAR data is first projected into the range-view and encoded by a LiDAR encoder, while camera features are simultaneously extracted and projected into the same shared space. At its core, UP-Fuse employs an uncertainty-guided fusion module that dynamically modulates cross-modal interaction using predicted uncertainty maps. These maps are learned by quantifying representational divergence under diverse visual degradations, ensuring that only reliable visual cues influence the fused representation. The fused range-view features are decoded by a novel hybrid 2D-3D transformer that mitigates spatial ambiguities inherent to the 2D projection and directly predicts 3D panoptic segmentation masks. Extensive experiments on Panoptic nuScenes, SemanticKITTI, and our introduced Panoptic Waymo benchmark demonstrate the efficacy and robustness of UP-Fuse, which maintains strong performance even under severe visual corruption or misalignment, making it well suited for robotic perception in safety-critical settings.
