Table of Contents
Fetching ...

Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving

Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, Junyuan Zhang, Weijia Li, Conghui He, Yafei Wang, Linfeng Zhang

TL;DR

Prune2Drive is proposed, a plug-and-play visual token pruning framework for multi-view VLMs in AD that introduces a diversity-aware token selection mechanism that prioritizes semantic and spatial coverage across views, and a view-adaptive pruning controller that automatically learns optimal pruning ratios based on camera importance to downstream tasks.

Abstract

Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), providing a unified framework for perception and decision-making. However, their real-world deployment is hindered by significant computational overhead when processing high-resolution, multi-view images. This complexity stems from the massive number of visual tokens, which increases inference latency and memory consumption due to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in AD. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism that prioritizes semantic and spatial coverage across views, and (ii) a view-adaptive pruning controller that automatically learns optimal pruning ratios based on camera importance to downstream tasks. Unlike prior methods, Prune2Drive requires no model retraining or access to attention maps, ensuring compatibility with modern efficient attention implementations. Extensive experiments on the DriveLM and DriveLMM-o1 benchmarks demonstrate that Prune2Drive achieves significant speedups and memory savings with minimal performance impact. When retaining only 10% of visual tokens, our method achieves a 6.40x speedup in the prefilling phase and consumes only 13.4% of the original FLOPs, with a mere 3% average performance drop on the DriveLM benchmark. Code is available at: https://github.com/MinhaoXiong/Prune2Drive.git

Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving

TL;DR

Prune2Drive is proposed, a plug-and-play visual token pruning framework for multi-view VLMs in AD that introduces a diversity-aware token selection mechanism that prioritizes semantic and spatial coverage across views, and a view-adaptive pruning controller that automatically learns optimal pruning ratios based on camera importance to downstream tasks.

Abstract

Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), providing a unified framework for perception and decision-making. However, their real-world deployment is hindered by significant computational overhead when processing high-resolution, multi-view images. This complexity stems from the massive number of visual tokens, which increases inference latency and memory consumption due to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in AD. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism that prioritizes semantic and spatial coverage across views, and (ii) a view-adaptive pruning controller that automatically learns optimal pruning ratios based on camera importance to downstream tasks. Unlike prior methods, Prune2Drive requires no model retraining or access to attention maps, ensuring compatibility with modern efficient attention implementations. Extensive experiments on the DriveLM and DriveLMM-o1 benchmarks demonstrate that Prune2Drive achieves significant speedups and memory savings with minimal performance impact. When retaining only 10% of visual tokens, our method achieves a 6.40x speedup in the prefilling phase and consumes only 13.4% of the original FLOPs, with a mere 3% average performance drop on the DriveLM benchmark. Code is available at: https://github.com/MinhaoXiong/Prune2Drive.git

Paper Structure

This paper contains 30 sections, 1 theorem, 5 equations, 6 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Under Assumption assump:view_lipschitz, our methd, achieves a provably tighter error bound than a baseline using uniform ratios and random sampling for the same total budget $K_{total}$:

Figures (6)

  • Figure 1: Comparison of the front-view and back-view visual token selection between Prune2Drive and FastV. The green box highlights important objects captured by Prune2Drive, the red text wrongly describes the scene, and the green text matches the scene.
  • Figure 2: Prune2Drive acheives SOTA in both DriveLMM-o1 and DriveLM benchmarks.
  • Figure 3: Detailed architecture of Prune2Drive. (a) VLM workflow in Prune2Drive, (b) View-adaptive pruning ratio optimization, where view-specific token pruning ratios are automatically determined, and (c) Diversity-aware T-FPS token pruning strategy, which preserves visual tokens that contain rich semantic and spatial information across multi-view inputs.
  • Figure 4: Quantitative results of selected visual tokens. We compare selected visual tokens by FastV, DART and Prune2Drive. The red box indicates the position bias of attention-based token-pruning method, where posterior tokens are retained, and the green bounding boxes highlight critical objects captured by Prune2Drive, which enables view-importance assignment and diversity-aware token selection.
  • Figure 5: Quantitative results of selected visual tokens. We compare selected visual tokens by FastV, DART, and Prune2Drive. FastV shows position bias (red boxes), retaining mostly posterior tokens, DART neglects view importance, while our Prune2Drive (green boxes) captures critical objects through view-importance and diversity-aware selection.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1: Optimality of View-Adaptive Diversity Pruning