Table of Contents
Fetching ...

MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving

Zhiyuan Zhang, Xiaofan Li, Zhihao Xu, Wenjie Peng, Zijian Zhou, Miaojing Shi, Shuangping Huang

TL;DR

MPDrive tackles the core challenge of spatial understanding in AD-VQA by substituting text-based spatial coordinates with concise visual markers annotated on detected objects. It introduces two modules, Marker ControlNet (MCNet) and Perception-Enhanced Spatial Prompt Learning (PSPL), to fuse original image features with marker-derived cues and to generate scene- and instance-level prompts that guide a Large Language Model in spatial reasoning. The approach yields state-of-the-art results on DriveLM and CODA-LM, with marked gains in both spatial perception metrics (e.g., match) and language-based metrics (e.g., BLEU-4, METEOR), across multi-view and single-view tasks. These findings demonstrate a practical path to aligning visual spatial representations with linguistic descriptions, potentially improving safety-critical decision-making in autonomous driving, though performance remains tied to the quality of the detection expert and long-horizon temporal reasoning remains a challenge. Key contributions include the MPDrive framework, the MCNet fusion mechanism with a zero-initialized fusion layer and LoRA-based training, and the PSPL module delivering both scene- and instance-level prompts to the LLM for improved spatial understanding.$

Abstract

Autonomous driving visual question answering (AD-VQA) aims to answer questions related to perception, prediction, and planning based on given driving scene images, heavily relying on the model's spatial understanding capabilities. Prior works typically express spatial information through textual representations of coordinates, resulting in semantic gaps between visual coordinate representations and textual descriptions. This oversight hinders the accurate transmission of spatial information and increases the expressive burden. To address this, we propose a novel Marker-based Prompt learning framework (MPDrive), which represents spatial coordinates by concise visual markers, ensuring linguistic expressive consistency and enhancing the accuracy of both visual perception and spatial expression in AD-VQA. Specifically, we create marker images by employing a detection expert to overlay object regions with numerical labels, converting complex textual coordinate generation into straightforward text-based visual marker predictions. Moreover, we fuse original and marker images as scene-level features and integrate them with detection priors to derive instance-level features. By combining these features, we construct dual-granularity visual prompts that stimulate the LLM's spatial perception capabilities. Extensive experiments on the DriveLM and CODA-LM datasets show that MPDrive achieves state-of-the-art performance, particularly in cases requiring sophisticated spatial understanding.

MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving

TL;DR

MPDrive tackles the core challenge of spatial understanding in AD-VQA by substituting text-based spatial coordinates with concise visual markers annotated on detected objects. It introduces two modules, Marker ControlNet (MCNet) and Perception-Enhanced Spatial Prompt Learning (PSPL), to fuse original image features with marker-derived cues and to generate scene- and instance-level prompts that guide a Large Language Model in spatial reasoning. The approach yields state-of-the-art results on DriveLM and CODA-LM, with marked gains in both spatial perception metrics (e.g., match) and language-based metrics (e.g., BLEU-4, METEOR), across multi-view and single-view tasks. These findings demonstrate a practical path to aligning visual spatial representations with linguistic descriptions, potentially improving safety-critical decision-making in autonomous driving, though performance remains tied to the quality of the detection expert and long-horizon temporal reasoning remains a challenge. Key contributions include the MPDrive framework, the MCNet fusion mechanism with a zero-initialized fusion layer and LoRA-based training, and the PSPL module delivering both scene- and instance-level prompts to the LLM for improved spatial understanding.$

Abstract

Autonomous driving visual question answering (AD-VQA) aims to answer questions related to perception, prediction, and planning based on given driving scene images, heavily relying on the model's spatial understanding capabilities. Prior works typically express spatial information through textual representations of coordinates, resulting in semantic gaps between visual coordinate representations and textual descriptions. This oversight hinders the accurate transmission of spatial information and increases the expressive burden. To address this, we propose a novel Marker-based Prompt learning framework (MPDrive), which represents spatial coordinates by concise visual markers, ensuring linguistic expressive consistency and enhancing the accuracy of both visual perception and spatial expression in AD-VQA. Specifically, we create marker images by employing a detection expert to overlay object regions with numerical labels, converting complex textual coordinate generation into straightforward text-based visual marker predictions. Moreover, we fuse original and marker images as scene-level features and integrate them with detection priors to derive instance-level features. By combining these features, we construct dual-granularity visual prompts that stimulate the LLM's spatial perception capabilities. Extensive experiments on the DriveLM and CODA-LM datasets show that MPDrive achieves state-of-the-art performance, particularly in cases requiring sophisticated spatial understanding.

Paper Structure

This paper contains 23 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison of object response process between mainstream MLLMs (red box) and our proposed MPDrive (green box). Current research directly represents object spatial coordinates in text format, leading to semantic gaps between coordinates and text descriptions. This misalignment adversely impacts subsequent prediction and planning tasks. In contrast, MPDrive converts complex spatial coordinate generation into text-based visual marker (region with numerical label) predictions, ensuring linguistic consistency.
  • Figure 2: Overview of the MPDrive framework. For clarity, we illustrate the process using a single-view image. The detection expert generates a visual marker image $I_m$. This marker image $I_m$ and the original image $I$ are processed by MCNet to extract scene-level features $y_s$. For the Perception-Enhanced Spatial Prompt Learning module, these scene-level features $y_s$ undergo mask average pooling for each instance mask to obtain instance-level features $y_i$. Subsequently, both scene-level features $y_s$ and instance-level features $y_i$ are processed through a connected MLP to generate visual prompts $T_s$ and $T_i$ respectively. Finally, these visual prompts, combined with text embeddings, are fed into the Large Language Model to generate the output $\hat{s}$. For coordinate prediction, MPDrive predicts the marker index $k$ corresponding to the target object and then converts it into the respective coordinates.
  • Figure 3: Comparison of the responses between InternVL-2 and our proposed MPDrive. The yellow ( ) area and dots represent the response and coordinates of ground truth (GT), the green ( ) area and dots indicate the response and coordinates of MPDrive, the red ( ) area and dots denote the response and coordinates of InternVL-2. The blue box ($\color{blue}$) indicates the image that is most relevant to the response, with an enlarged version of this image located in the bottom right corner of each sample, the orange dots ($\bullet$) represent the positions of the coordinates in the image related to the question.
  • Figure 4: Visual prompt activation examples between InternVL-2 and our proposed MPDrive.
  • Figure 5: Comparison of different components of MPDrive on the responses. The yellow ( ) area and dots represent the response and coordinates of ground truth (GT), the brown ( ) area and dots indicate the response and coordinates after adding the Visual Marker, the red ( ) area and dots denote the response and coordinates after adding the Visual Marker and the MCNet, and the green ( ) area and dots indicate the response and coordinates of MPDrive.
  • ...and 1 more figures