Table of Contents
Fetching ...

DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception

Tim Broedermannn, Christos Sakaridis, Luigi Piccinelli, Wim Abbeloos, Luc Van Gool

TL;DR

DGFusion tackles robust multimodal semantic perception for autonomous driving by introducing depth-guided fusion that conditions cross-modal attention on local depth cues while maintaining a global environmental condition token. It reframes fusion as a multi-task problem by adding an auxiliary, lidar-supervised depth head, enabling depth-informed features to guide region-level sensor weighting without extra inference cost. The approach yields state-of-the-art results on MUSES and DeLiVER, particularly under adverse weather and lighting, and ablations highlight the complementary benefits of local depth tokens and the global condition cue. The method remains computationally efficient, with only a small parameter increase and practical FPS, supporting real-world deployment.

Abstract

Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model's inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DeLiVER datasets. Code and models will be available at https://github.com/timbroed/DGFusion

DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception

TL;DR

DGFusion tackles robust multimodal semantic perception for autonomous driving by introducing depth-guided fusion that conditions cross-modal attention on local depth cues while maintaining a global environmental condition token. It reframes fusion as a multi-task problem by adding an auxiliary, lidar-supervised depth head, enabling depth-informed features to guide region-level sensor weighting without extra inference cost. The approach yields state-of-the-art results on MUSES and DeLiVER, particularly under adverse weather and lighting, and ablations highlight the complementary benefits of local depth tokens and the global condition cue. The method remains computationally efficient, with only a small parameter increase and practical FPS, supporting real-world deployment.

Abstract

Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model's inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DeLiVER datasets. Code and models will be available at https://github.com/timbroed/DGFusion

Paper Structure

This paper contains 11 sections, 17 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Intuition on DGFusion. Unlike previous sensor fusion works that use lidar only as an input, we additionally utilize this readily available modality for depth supervision to create a multi-tasking setup, hinging on the well-known benefits of depth estimation for semantic perception.
  • Figure 2: DGFusion overview. We process all input modalities with a shared backbone with individual feature adapters. Their outputs are split into three branches: the depth estimation branch at the top, the segmentation branch in the middle, and the condition representation branch at the bottom. In our multi-task setup, the sparse and noisy lidar serves as supervision for the auxiliary depth head, enabling the network to learn depth-informed features that improve semantic representations. Both the depth and the condition branch send additional features into the Depth-Guided Fusion modules. In these modules, features from the RGB input, the respective secondary modality, and the depth are divided into local windows. Each depth window is processed to extract a local Depth Token (DT), which is concatenated with the RGB tokens and the Condition Token (CT) to form the set of queries for cross-attention. After fusion, the DT and CT are removed, and the windows are reassembled, yielding enriched features that are fed to the segmentation head to produce the final segmentation prediction.
  • Figure 3: Qualitative comparison on MUSES with visualization of the input modalities. Best viewed on a screen at full zoom.
  • Figure 4: Visual study Qualitative ablation on our loss design on the MUSES dataset. Best viewed on screen at full zoom.
  • Figure 5: Further qualitative results on MUSES. Best viewed on a screen at full zoom.