World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving
Mingliang Zhai, Cheng Li, Zengyuan Guo, Ningrui Yang, Xiameng Qin, Sanyuan Zhao, Junyu Han, Ji Tao, Yuwei Wu, Yunde Jia
TL;DR
This work tackles perception-limited regions in autonomous driving by introducing a plug-and-play instruction-guided interactor that pre-fuses multi-view visual features with world knowledge through top-$k$ token selection and cross-attention, enabling efficient reasoning with a large language model. A three-stage training pipeline aligns single-view and multi-view visual-language representations before task-specific instruction tuning, while a large-scale data framework (2M QA, 1.7M grounding) and a 200K object-level risk assessment QA dataset support robust evaluation of reasoning under occlusions. Empirical results across NuScenes-MQA, OmniDrive-NuScenes, NuInstruct, and ORA demonstrate state-of-the-art gains in reasoning, grounding, and planning tasks, with ablations confirming the pivotal roles of the interactor and top-$k$ selection. The study highlights practical gains in open-loop planning and provides a comprehensive dataset and analysis that advance world-knowledge–driven autonomous driving under perception constraints, with future work addressing closed-loop settings and 3D grounding challenges.
Abstract
The Multi-modal Large Language Models (MLLMs) with extensive world knowledge have revitalized autonomous driving, particularly in reasoning tasks within perceivable regions. However, when faced with perception-limited areas (dynamic or static occlusion regions), MLLMs struggle to effectively integrate perception ability with world knowledge for reasoning. These perception-limited regions can conceal crucial safety information, especially for vulnerable road users. In this paper, we propose a framework, which aims to improve autonomous driving performance under perceptionlimited conditions by enhancing the integration of perception capabilities and world knowledge. Specifically, we propose a plug-and-play instruction-guided interaction module that bridges modality gaps and significantly reduces the input sequence length, allowing it to adapt effectively to multi-view video inputs. Furthermore, to better integrate world knowledge with driving-related tasks, we have collected and refined a large-scale multi-modal dataset that includes 2 million natural language QA pairs, 1.7 million grounding task data. To evaluate the model's utilization of world knowledge, we introduce an object-level risk assessment dataset comprising 200K QA pairs, where the questions necessitate multi-step reasoning leveraging world knowledge for resolution. Extensive experiments validate the effectiveness of our proposed method.
