Table of Contents
Fetching ...

VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

Ruifei Zhang, Wei Zhang, Xiao Tan, Sibei Yang, Xiang Wan, Xiaonan Luo, Guanbin Li

TL;DR

VLDrive tackles the practical challenges of language-grounded autonomous driving by showing that visual perception gaps and heavy LLM parameters limit deployment. It introduces a vision-augmented lightweight MLLM architecture with three core innovations: cycle-consistent dynamic visual pruning (CCDP) to select salient visual tokens, memory-enhanced feature aggregation (MEFA) to exploit temporal cues, and distance-decoupled instruction attention (DDIA) to maintain robust instruction grounding. A training-only token reconstruction task reinforces information preservation and cycle-consistency, improving visual-linguistic alignment. Evaluated on CARLA LangAuto benchmarks, VLDrive achieves state-of-the-art driving performance with about an 81% reduction in parameters, demonstrating strong practical potential for efficient, safe language-grounded autonomous driving.

Abstract

Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs pose considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to improve joint visual-linguistic feature learning, particularly for long-range visual tokens. Extensive experiments conducted in the CARLA simulator demonstrate VLDrive`s effectiveness. Notably, VLDrive achieves state-of-the-art driving performance while reducing parameters by 81% (from 7B to 1.3B), yielding substantial driving score improvements of 15.4%, 16.8%, and 7.6% at tiny, short, and long distances, respectively, in closed-loop evaluations. Code is available at https://github.com/ReaFly/VLDrive.

VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

TL;DR

VLDrive tackles the practical challenges of language-grounded autonomous driving by showing that visual perception gaps and heavy LLM parameters limit deployment. It introduces a vision-augmented lightweight MLLM architecture with three core innovations: cycle-consistent dynamic visual pruning (CCDP) to select salient visual tokens, memory-enhanced feature aggregation (MEFA) to exploit temporal cues, and distance-decoupled instruction attention (DDIA) to maintain robust instruction grounding. A training-only token reconstruction task reinforces information preservation and cycle-consistency, improving visual-linguistic alignment. Evaluated on CARLA LangAuto benchmarks, VLDrive achieves state-of-the-art driving performance with about an 81% reduction in parameters, demonstrating strong practical potential for efficient, safe language-grounded autonomous driving.

Abstract

Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs pose considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to improve joint visual-linguistic feature learning, particularly for long-range visual tokens. Extensive experiments conducted in the CARLA simulator demonstrate VLDrive`s effectiveness. Notably, VLDrive achieves state-of-the-art driving performance while reducing parameters by 81% (from 7B to 1.3B), yielding substantial driving score improvements of 15.4%, 16.8%, and 7.6% at tiny, short, and long distances, respectively, in closed-loop evaluations. Code is available at https://github.com/ReaFly/VLDrive.

Paper Structure

This paper contains 25 sections, 13 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: (a) Existing methods for language-grounded driving. (b) Our proposed VLDrive: a novel framework featuring a lightweight MLLM architecture with enhanced vision components. (c) Driving failure analysis of existing methods based on three evaluation runs. (d) Performance comparison between VLDrive and both versions of LMDrive, highlighting our method's superior driving performance with fewer parameters.
  • Figure 2: An overview of our proposed VLDrive framework. Given a sequence of visual data, our connector transforms each frame's raw visual features $\mathbf{F}_i$ into sparse yet informative representations $\mathbf{F}_i^v$ through two key components: CCDP: Token Sparsification and Memory-enhanced Feature Aggregation (MEFA). Subsequently, a lite language model augmented with Distance-decoupled Instruction Attention (DDIA) jointly processes tokenized navigation instructions $\mathbf{F}_t$ and temporal visual features $\mathbf{F}_v = \{\mathbf{F}_i^v \mid i = 1,\ldots,T\}$. The resulting hidden representations are fed into an MLP for trajectory prediction, followed by PID controllers that translate these predictions into concrete driving actions. Additionally, we incorporate CCDP: Token Reconstruction as a training-only auxiliary task to further strengthen visual information integrity of $\mathbf{F}_i^v$ via explicit token reconstruction.
  • Figure 3: A detailed illustration of our proposed connector. CCDP: Token Sparsification and Memory-enhanced Feature Aggregation are proposed to reduce token count while enhancing information density. CCDP: Token Reconstruction serves as a training-only task that further ensures the information integrity of the retained tokens. Red arrows ($\rightarrow$) indicate the input and output paths of our connector.
  • Figure 4: A detailed illustration of our proposed Distance-decoupled Instruction Attention (DDIA).
  • Figure 5: Correlation analysis between reconstruction and trajectory prediction losses, revealing a significant positive relationship (Pearson's r = 0.65).
  • ...and 1 more figures