EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation

Jiajun Cao; Xiaoan Zhang; Xiaobao Wei; Liyuqiu Huang; Wang Zijian; Hanzhen Zhang; Zhengyu Jia; Wei Mao; Hao Wang; Xianming Liu; Shuchang Zhou Liu; Yang Wang; Shanghang Zhang

EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation

Jiajun Cao, Xiaoan Zhang, Xiaobao Wei, Liyuqiu Huang, Wang Zijian, Hanzhen Zhang, Zhengyu Jia, Wei Mao, Hao Wang, Xianming Liu, Shuchang Zhou Liu, Yang Wang, Shanghang Zhang

TL;DR

EvoDriveVLA is a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization, and achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation.

Abstract

Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, oracle-guided trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to produce high-quality trajectory candidates, thereby selecting the optimal trajectory to guide the student's prediction. EvoDriveVLA achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.

EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation

TL;DR

Abstract

Paper Structure (28 sections, 11 equations, 10 figures, 4 tables)

This paper contains 28 sections, 11 equations, 10 figures, 4 tables.

Introduction
Related Work
End-to-End Autonomous Driving
Vision-Language-Action Models in Driving
Distilling Knowledge for Autonomous Driving
Methodology
Preliminary
Self-Anchored Visual Distillation
Tajectory-Guided Anchoring Constraints.
AnchorFormer Architecture.
Visual Distillation Loss.
Oracle-Guided Trajectory Distillation
The Future-Aware Oracle Teacher.
Coarse-to-Fine Trajectory Refinement.
MC-Dropout Trajectory Sampling.
...and 13 more sections

Figures (10)

Figure 1: Comparison of existing knowledge distillation paradigms for autonomous driving. (a) Single-Trajectory Distillation; (b) Multi-Trajectory Distillation; (c) Collaborative Perception-Planning Distillation (Ours).
Figure 2: Overview of the EvoDriveVLA framework. (Left) Self-anchored visual distillation imposes token-leve visual anchoring constraints across the scene; (Right) Oracle-guided trajectory distillation leverages future ground-truth information for trajectory refinement and diversity sampling; (Middle) Collaborative perception-planning distillation enhances autonomous driving VLA model capabilities in both perception and planning to achieve superior driving performance.
Figure 3: Kernel density estimation of trajectory loss distributions for pre-refine and post-refine trajectories. The overlaid boxplots summarize the median, interquartile range, and extreme values.
Figure 4: Comparison of trajectory loss distributions before and after MC-Dropout trajectory sampling.
Figure 5: Qualitative comparison on nuScenes. Our method achieves more accurate long-horizon predictions than VAD and OmniDrive.
...and 5 more figures

EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation

TL;DR

Abstract

EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)