Table of Contents
Fetching ...

EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, Linfeng Zhang

TL;DR

EfficientVLA tackles the heavy compute and memory demands of diffusion-based Vision-Language-Action models by proposing a training-free, structured acceleration framework. It strategically prunes functionally redundant language layers, curates a compact, task-focused and diverse set of visual tokens, and caches intermediate diffusion-head computations to reduce temporal redundancy. The approach is demonstrated on CogACT within the SIMPLER environment, delivering about 1.93x speedup and reducing FLOPs to 28.9% with only 0.6% accuracy loss, outperforming token-only baselines. This holistic, training-free solution enhances deployability of VLA systems on resource-constrained robotics platforms and provides a scalable blueprint for future VLA acceleration without retraining.

Abstract

Vision-Language-Action (VLA) models, particularly diffusion-based architectures, demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands stemming from extensive inherent and inference-time redundancies. While existing acceleration efforts often target isolated inefficiencies, such piecemeal solutions typically fail to holistically address the varied computational and memory bottlenecks across the entire VLA pipeline, thereby limiting practical deployability. We introduce EfficientVLA, a structured and training-free inference acceleration framework that systematically eliminates these barriers by cohesively exploiting multifaceted redundancies. EfficientVLA synergistically integrates three targeted strategies: (1) pruning of functionally inconsequential layers from the language module, guided by an analysis of inter-layer redundancies; (2) optimizing the visual processing pathway through a task-aware strategy that selects a compact, diverse set of visual tokens, balancing task-criticality with informational coverage; and (3) alleviating temporal computational redundancy within the iterative diffusion-based action head by strategically caching and reusing key intermediate features. We apply our method to a standard VLA model CogACT, yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6% success rate drop in the SIMPLER benchmark.

EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

TL;DR

EfficientVLA tackles the heavy compute and memory demands of diffusion-based Vision-Language-Action models by proposing a training-free, structured acceleration framework. It strategically prunes functionally redundant language layers, curates a compact, task-focused and diverse set of visual tokens, and caches intermediate diffusion-head computations to reduce temporal redundancy. The approach is demonstrated on CogACT within the SIMPLER environment, delivering about 1.93x speedup and reducing FLOPs to 28.9% with only 0.6% accuracy loss, outperforming token-only baselines. This holistic, training-free solution enhances deployability of VLA systems on resource-constrained robotics platforms and provides a scalable blueprint for future VLA acceleration without retraining.

Abstract

Vision-Language-Action (VLA) models, particularly diffusion-based architectures, demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands stemming from extensive inherent and inference-time redundancies. While existing acceleration efforts often target isolated inefficiencies, such piecemeal solutions typically fail to holistically address the varied computational and memory bottlenecks across the entire VLA pipeline, thereby limiting practical deployability. We introduce EfficientVLA, a structured and training-free inference acceleration framework that systematically eliminates these barriers by cohesively exploiting multifaceted redundancies. EfficientVLA synergistically integrates three targeted strategies: (1) pruning of functionally inconsequential layers from the language module, guided by an analysis of inter-layer redundancies; (2) optimizing the visual processing pathway through a task-aware strategy that selects a compact, diverse set of visual tokens, balancing task-criticality with informational coverage; and (3) alleviating temporal computational redundancy within the iterative diffusion-based action head by strategically caching and reusing key intermediate features. We apply our method to a standard VLA model CogACT, yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6% success rate drop in the SIMPLER benchmark.

Paper Structure

This paper contains 27 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: VLA inference bottleneck and redundancy analysis: (a) Visual token pruning impact on FLOPs and inference time, revealing computation-bound and memory-bound regimes. (b) High inter-layer cosine similarity of LLM hidden states, indicating depth-wise redundancy. (c) Temporal cosine similarity of MLP/attention features in diffusion steps, showing computational redundancy.
  • Figure 2: Overview of the EfficientVLA framework, our training-free, structured approach to accelerate Diffusion-based VLAs. It employs: (1) pruning of redundant language module layers; (2) VLA task-aware visual token selection balancing task relevance and informational diversity; and (3) temporal caching of intermediate featuresin the diffusion action head.
  • Figure 3: Efficiency analysis in simulation, comparing FLOPs and inference time of our EfficientVLA variants against the original model backbone. EfficientVLA-22 and EfficientVLA-28 denote configurations retaining 22 and 28 LLM layers, respectively.
  • Figure 4: Representative robotic manipulation tasks for the Google robot in the SIMPLER environment: (a) Pick coke can, (b) Move near, (c) Open/close drawer, and (d) Open top drawer and place apple.