Table of Contents
Fetching ...

EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation

Jiajun Cao, Xiaoan Zhang, Xiaobao Wei, Liyuqiu Huang, Wang Zijian, Hanzhen Zhang, Zhengyu Jia, Wei Mao, Hao Wang, Xianming Liu, Shuchang Zhou Liu, Yang Wang, Shanghang Zhang

TL;DR

EvoDriveVLA is a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization, and achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation.

Abstract

Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, oracle-guided trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to produce high-quality trajectory candidates, thereby selecting the optimal trajectory to guide the student's prediction. EvoDriveVLA achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.

EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation

TL;DR

EvoDriveVLA is a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization, and achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation.

Abstract

Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, oracle-guided trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to produce high-quality trajectory candidates, thereby selecting the optimal trajectory to guide the student's prediction. EvoDriveVLA achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.
Paper Structure (28 sections, 11 equations, 10 figures, 4 tables)

This paper contains 28 sections, 11 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Comparison of existing knowledge distillation paradigms for autonomous driving. (a) Single-Trajectory Distillation; (b) Multi-Trajectory Distillation; (c) Collaborative Perception-Planning Distillation (Ours).
  • Figure 2: Overview of the EvoDriveVLA framework. (Left) Self-anchored visual distillation imposes token-leve visual anchoring constraints across the scene; (Right) Oracle-guided trajectory distillation leverages future ground-truth information for trajectory refinement and diversity sampling; (Middle) Collaborative perception-planning distillation enhances autonomous driving VLA model capabilities in both perception and planning to achieve superior driving performance.
  • Figure 3: Kernel density estimation of trajectory loss distributions for pre-refine and post-refine trajectories. The overlaid boxplots summarize the median, interquartile range, and extreme values.
  • Figure 4: Comparison of trajectory loss distributions before and after MC-Dropout trajectory sampling.
  • Figure 5: Qualitative comparison on nuScenes. Our method achieves more accurate long-horizon predictions than VAD and OmniDrive.
  • ...and 5 more figures