Table of Contents
Fetching ...

DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

Haibo HU, Lianming Huang, Nan Guan, Chun Jason Xue

TL;DR

DeeAD tackles the latency of Vision-Language-Action autonomous driving by introducing a training-free, action-guided early-exit mechanism. It uses an Early Exit Action Head to generate intermediate trajectories, a Dissimilarity Estimator to compare them against a lightweight navigation prior, and a Multi-Hop Exit Controller to adaptively skip layers; exit is triggered when Dis^{(l)}<\delta, with default $\delta=1.0$ m. Implemented on ORION and evaluated on Bench2Drive, DeeAD achieves up to $28\%$ transformer-layer sparsity and around $29\%$ latency reduction while preserving planning quality and safety; strict tolerances yield safer but less sparse exits, whereas looser tolerances boost sparsity at modest cost to accuracy. Overall, DeeAD enables real-time deployment of VLA planning by grounding early exits in physical feasibility rather than confidence, with minimal runtime overhead.

Abstract

Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.

DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

TL;DR

DeeAD tackles the latency of Vision-Language-Action autonomous driving by introducing a training-free, action-guided early-exit mechanism. It uses an Early Exit Action Head to generate intermediate trajectories, a Dissimilarity Estimator to compare them against a lightweight navigation prior, and a Multi-Hop Exit Controller to adaptively skip layers; exit is triggered when Dis^{(l)}<\delta, with default m. Implemented on ORION and evaluated on Bench2Drive, DeeAD achieves up to transformer-layer sparsity and around latency reduction while preserving planning quality and safety; strict tolerances yield safer but less sparse exits, whereas looser tolerances boost sparsity at modest cost to accuracy. Overall, DeeAD enables real-time deployment of VLA planning by grounding early exits in physical feasibility rather than confidence, with minimal runtime overhead.

Abstract

Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.

Paper Structure

This paper contains 16 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of autonomous driving paradigms: (a) classic E2E, (b) VLM-based, (c) VLA, and (d) our DeeAD with Action-Guided Early Exit for efficient, physically consistent inference.
  • Figure 2: Illustration of action-space early exiting. Intermediate layers (e.g., L1, L5, L8) produce inaccurate or unsafe trajectories. In contrast, deeper layers (e.g., L14, L24, L32) yield trajectories that align closely with the navigation intent. Since L14 already falls within a reasonable driving corridor, inference can safely terminate early at this point, significantly reducing computational cost without compromising driving quality.
  • Figure 3: Layer-wise $\mathrm{L2}$ distance at 2s between intermediate and final trajectories on Bench2Drive using ORION. Each curve represents one representative case from 1,000 trials. The gray dashed line indicates the 2m tolerance threshold.
  • Figure 4: Overview of the proposed block-layer selective loading framework for multi-modal tasks. Inputs from the vision and text encoders are processed through a shared Transformer backbone, where task-specific layers are dynamically loaded from storage to GPU memory based on the selected cutting range.
  • Figure 5: Early-exit layer distributions under different spatial tolerance thresholds $\delta$. As $\delta$ decreases, the model becomes more conservative, shifting exits toward deeper layers.