CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes

Tim Broedermann; Christos Sakaridis; Yuqian Fu; Luc Van Gool

CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes

Tim Broedermann, Christos Sakaridis, Yuqian Fu, Luc Van Gool

TL;DR

CAFuser tackles robust semantic perception for autonomous driving under adverse conditions by introducing a condition-aware multimodal fusion framework. It uses a single shared backbone with modality adapters to map RGB, LiDAR, Radar, and Event inputs into a common latent space, and derives a Condition Token (CT) from RGB features to steer fusion through attention, trained with a verbo-visual contrastive loss. The framework supports multiple fusion strategies, with CA^2 fusion yielding the best performance by injecting the CT into cross-attention within local windows, enabling dynamic, condition-driven fusion. Empirical results on MUSES and DeLiVER show state-of-the-art performance for both panoptic and semantic segmentation, with substantial parameter reductions (~54%) due to the adapters. The approach demonstrates strong robustness and scalability, providing a practical path toward more reliable multimodal perception in diverse ODDs and sensor configurations.

Abstract

Leveraging multiple sensors is crucial for robust semantic perception in autonomous driving, as each sensor type has complementary strengths and weaknesses. However, existing sensor fusion methods often treat sensors uniformly across all conditions, leading to suboptimal performance. By contrast, we propose a novel, condition-aware multimodal fusion approach for robust semantic perception of driving scenes. Our method, CAFuser, uses an RGB camera input to classify environmental conditions and generate a Condition Token that guides the fusion of multiple sensor modalities. We further newly introduce modality-specific feature adapters to align diverse sensor inputs into a shared latent space, enabling efficient integration with a single and shared pre-trained backbone. By dynamically adapting sensor fusion based on the actual condition, our model significantly improves robustness and accuracy, especially in adverse-condition scenarios. CAFuser ranks first on the public MUSES benchmarks, achieving 59.7 PQ for multimodal panoptic and 78.2 mIoU for semantic segmentation, and also sets the new state of the art on DeLiVER. The source code is publicly available at: https://github.com/timbroed/CAFuser.

CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes

TL;DR

Abstract

CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)