DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

Mingzhe Tao; Ruiping Liu; Junwei Zheng; Yufan Chen; Kedi Ying; M. Saquib Sarfraz; Kailun Yang; Jiaming Zhang; Rainer Stiefelhagen

DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

Mingzhe Tao, Ruiping Liu, Junwei Zheng, Yufan Chen, Kedi Ying, M. Saquib Sarfraz, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

Abstract

Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes $102,505$ QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: $53.5$ vs. $25.1$ for the baseline). The established dataset and source code will be made publicly available.

DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

Abstract

QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore:

vs.

for the baseline). The established dataset and source code will be made publicly available.

Paper Structure (28 sections, 13 equations, 5 figures, 4 tables)

This paper contains 28 sections, 13 equations, 5 figures, 4 tables.

Introduction
Related Work
Autonomous Driving VQA Datasets and Benchmarks
Multi-Modal Large Language Models for Sensor Fusion
DriveXQA Dataset
Dataset Generation
Hierarchical XQA
Global Scene Level
Allocentric Level
Ego-Vehicle Centric Level
MVX-LLM Framework
Overall Architecture
Dual Cross-Attention Mechanism
Token Aggregation
Global Average Pooling (GAP)
...and 13 more sections

Figures (5)

Figure 1: Left: Two corner cases of adverse driving scenes (foggy condition causing poor visibility and camera over-exposure resulting in degraded image quality). Right: Performance (GPTScore) comparison on DriveXQA dataset.
Figure 2: Statistics of DriveXQA dataset showing distribution of weather conditions (left) and distribution of sensor failure types (right): MB (Motion Blur), OE (Overexposure), UE (Underexposure), LJ (LiDAR Jitter), EL (Event Low-resolution).
Figure 3: Hierarchical XQA examples on DriveXQA dataset. The framework demonstrates three semantic levels: Global Scene Level, Allocentric Level, and Ego-Vehicle Centric Level.
Figure 4: Overview of MVX-LLM. The framework processes multi-modal sensor inputs (RGB, Depth, Event cameras from four viewpoints, and LiDAR point clouds) through specialized encoders. The DCA mechanism integrates RGB, Depth, and Event features before token replacement. The Question Answering component utilizes the fused representations to generate hierarchical responses across Global Scene, Allocentric, and Ego-Vehicle Centric levels under adverse driving conditions.
Figure 5: Qualitative analysis of multi-modal fusion performance under night conditions with camera overexposure.

DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

Abstract

DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

Authors

Abstract

Table of Contents

Figures (5)