MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving
Zhiyuan Zhang, Xiaofan Li, Zhihao Xu, Wenjie Peng, Zijian Zhou, Miaojing Shi, Shuangping Huang
TL;DR
MPDrive tackles the core challenge of spatial understanding in AD-VQA by substituting text-based spatial coordinates with concise visual markers annotated on detected objects. It introduces two modules, Marker ControlNet (MCNet) and Perception-Enhanced Spatial Prompt Learning (PSPL), to fuse original image features with marker-derived cues and to generate scene- and instance-level prompts that guide a Large Language Model in spatial reasoning. The approach yields state-of-the-art results on DriveLM and CODA-LM, with marked gains in both spatial perception metrics (e.g., match) and language-based metrics (e.g., BLEU-4, METEOR), across multi-view and single-view tasks. These findings demonstrate a practical path to aligning visual spatial representations with linguistic descriptions, potentially improving safety-critical decision-making in autonomous driving, though performance remains tied to the quality of the detection expert and long-horizon temporal reasoning remains a challenge. Key contributions include the MPDrive framework, the MCNet fusion mechanism with a zero-initialized fusion layer and LoRA-based training, and the PSPL module delivering both scene- and instance-level prompts to the LLM for improved spatial understanding.$
Abstract
Autonomous driving visual question answering (AD-VQA) aims to answer questions related to perception, prediction, and planning based on given driving scene images, heavily relying on the model's spatial understanding capabilities. Prior works typically express spatial information through textual representations of coordinates, resulting in semantic gaps between visual coordinate representations and textual descriptions. This oversight hinders the accurate transmission of spatial information and increases the expressive burden. To address this, we propose a novel Marker-based Prompt learning framework (MPDrive), which represents spatial coordinates by concise visual markers, ensuring linguistic expressive consistency and enhancing the accuracy of both visual perception and spatial expression in AD-VQA. Specifically, we create marker images by employing a detection expert to overlay object regions with numerical labels, converting complex textual coordinate generation into straightforward text-based visual marker predictions. Moreover, we fuse original and marker images as scene-level features and integrate them with detection priors to derive instance-level features. By combining these features, we construct dual-granularity visual prompts that stimulate the LLM's spatial perception capabilities. Extensive experiments on the DriveLM and CODA-LM datasets show that MPDrive achieves state-of-the-art performance, particularly in cases requiring sophisticated spatial understanding.
