Table of Contents
Fetching ...

DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

Mingzhe Tao, Ruiping Liu, Junwei Zheng, Yufan Chen, Kedi Ying, M. Saquib Sarfraz, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

Abstract

Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes $102,505$ QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: $53.5$ vs. $25.1$ for the baseline). The established dataset and source code will be made publicly available.

DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

Abstract

Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: vs. for the baseline). The established dataset and source code will be made publicly available.
Paper Structure (28 sections, 13 equations, 5 figures, 4 tables)

This paper contains 28 sections, 13 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Left: Two corner cases of adverse driving scenes (foggy condition causing poor visibility and camera over-exposure resulting in degraded image quality). Right: Performance (GPTScore) comparison on DriveXQA dataset.
  • Figure 2: Statistics of DriveXQA dataset showing distribution of weather conditions (left) and distribution of sensor failure types (right): MB (Motion Blur), OE (Overexposure), UE (Underexposure), LJ (LiDAR Jitter), EL (Event Low-resolution).
  • Figure 3: Hierarchical XQA examples on DriveXQA dataset. The framework demonstrates three semantic levels: Global Scene Level, Allocentric Level, and Ego-Vehicle Centric Level.
  • Figure 4: Overview of MVX-LLM. The framework processes multi-modal sensor inputs (RGB, Depth, Event cameras from four viewpoints, and LiDAR point clouds) through specialized encoders. The DCA mechanism integrates RGB, Depth, and Event features before token replacement. The Question Answering component utilizes the fused representations to generate hierarchical responses across Global Scene, Allocentric, and Ego-Vehicle Centric levels under adverse driving conditions.
  • Figure 5: Qualitative analysis of multi-modal fusion performance under night conditions with camera overexposure.