Table of Contents
Fetching ...

CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes

Tim Broedermann, Christos Sakaridis, Yuqian Fu, Luc Van Gool

TL;DR

CAFuser tackles robust semantic perception for autonomous driving under adverse conditions by introducing a condition-aware multimodal fusion framework. It uses a single shared backbone with modality adapters to map RGB, LiDAR, Radar, and Event inputs into a common latent space, and derives a Condition Token (CT) from RGB features to steer fusion through attention, trained with a verbo-visual contrastive loss. The framework supports multiple fusion strategies, with CA^2 fusion yielding the best performance by injecting the CT into cross-attention within local windows, enabling dynamic, condition-driven fusion. Empirical results on MUSES and DeLiVER show state-of-the-art performance for both panoptic and semantic segmentation, with substantial parameter reductions (~54%) due to the adapters. The approach demonstrates strong robustness and scalability, providing a practical path toward more reliable multimodal perception in diverse ODDs and sensor configurations.

Abstract

Leveraging multiple sensors is crucial for robust semantic perception in autonomous driving, as each sensor type has complementary strengths and weaknesses. However, existing sensor fusion methods often treat sensors uniformly across all conditions, leading to suboptimal performance. By contrast, we propose a novel, condition-aware multimodal fusion approach for robust semantic perception of driving scenes. Our method, CAFuser, uses an RGB camera input to classify environmental conditions and generate a Condition Token that guides the fusion of multiple sensor modalities. We further newly introduce modality-specific feature adapters to align diverse sensor inputs into a shared latent space, enabling efficient integration with a single and shared pre-trained backbone. By dynamically adapting sensor fusion based on the actual condition, our model significantly improves robustness and accuracy, especially in adverse-condition scenarios. CAFuser ranks first on the public MUSES benchmarks, achieving 59.7 PQ for multimodal panoptic and 78.2 mIoU for semantic segmentation, and also sets the new state of the art on DeLiVER. The source code is publicly available at: https://github.com/timbroed/CAFuser.

CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes

TL;DR

CAFuser tackles robust semantic perception for autonomous driving under adverse conditions by introducing a condition-aware multimodal fusion framework. It uses a single shared backbone with modality adapters to map RGB, LiDAR, Radar, and Event inputs into a common latent space, and derives a Condition Token (CT) from RGB features to steer fusion through attention, trained with a verbo-visual contrastive loss. The framework supports multiple fusion strategies, with CA^2 fusion yielding the best performance by injecting the CT into cross-attention within local windows, enabling dynamic, condition-driven fusion. Empirical results on MUSES and DeLiVER show state-of-the-art performance for both panoptic and semantic segmentation, with substantial parameter reductions (~54%) due to the adapters. The approach demonstrates strong robustness and scalability, providing a practical path toward more reliable multimodal perception in diverse ODDs and sensor configurations.

Abstract

Leveraging multiple sensors is crucial for robust semantic perception in autonomous driving, as each sensor type has complementary strengths and weaknesses. However, existing sensor fusion methods often treat sensors uniformly across all conditions, leading to suboptimal performance. By contrast, we propose a novel, condition-aware multimodal fusion approach for robust semantic perception of driving scenes. Our method, CAFuser, uses an RGB camera input to classify environmental conditions and generate a Condition Token that guides the fusion of multiple sensor modalities. We further newly introduce modality-specific feature adapters to align diverse sensor inputs into a shared latent space, enabling efficient integration with a single and shared pre-trained backbone. By dynamically adapting sensor fusion based on the actual condition, our model significantly improves robustness and accuracy, especially in adverse-condition scenarios. CAFuser ranks first on the public MUSES benchmarks, achieving 59.7 PQ for multimodal panoptic and 78.2 mIoU for semantic segmentation, and also sets the new state of the art on DeLiVER. The source code is publicly available at: https://github.com/timbroed/CAFuser.

Paper Structure

This paper contains 11 sections, 1 equation, 6 figures, 9 tables.

Figures (6)

  • Figure 1: CAFuser overview. We encode the weather and lighting conditions in a Condition Token and guide the condition-aware fusion with it.
  • Figure 2: Our proposed CAFuser architecture with RGB camera, lidar, radar, and event camera as input modalities. Each input is passed through the shared backbone and an individual feature adapter. The CT is generated from the highest-level RGB feature map, supervised with a verbo-visual contrastive loss to our encoded condition prompts ($Q_{text}$), and used to guide the condition-aware fusion (CAF). The resulting fused multi-scale feature maps are then passed to the pixel decoder and the OneFormer jain2023oneformer head to produce the prediction.
  • Figure 3: Condition-Aware Fusion (CAF) for our CA$^2$ variant. We apply multi-window cross-attention broedermann2023hrfuser by splitting each modality's feature map into local windows, fusing all secondary modalities in parallel with the RGB features by using our proposed condition-aware cross-attention (CA$^2$) module, and finally stitching the local windows back together.
  • Figure 4: Condition-Aware Cross-Attention (CA$^2$) applied to each local window, here illustrated for the case of RGB-lidar fusion. The Condition Token (CT) is passed through a fully connected layer to align with the feature dimension and concatenated with the RGB tokens to generate a condition-aware query for cross-attention. Afterwards, we remove the token corresponding to the CT to maintain the original spatial dimensions before reassembling the full feature map.
  • Figure 5: Average Condition-Aware Addition (CAA) fusion weights in % on the MUSES test set across different weather conditions and times of day. This figure illustrates how the relative contributions of each sensor modality vary under various environmental conditions, highlighting the adaptability of the fusion mechanism to changing visibility and lighting scenarios.
  • ...and 1 more figures