Table of Contents
Fetching ...

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

Hanzhen Wang, Jiaming Xu, Jiayi Pan, Yongkang Zhou, Guohao Dai

TL;DR

Vision-Language-Action models are compute-bound, and existing token pruning often hurts success rate by using only local cues. SpecPrune-VLA introduces action-aware self-speculative pruning with static action-level pruning using global past information, dynamic layer-level pruning, and a lightweight controller to adapt pruning by action granularity. On LIBERO benchmarks, it achieves ~1.46× speedup on NVIDIA A800 and ~1.57× on RTX 3090 with negligible loss in success rate, demonstrating cross-platform effectiveness. The approach leverages temporal redundancy and multi-level token importance to deliver practical acceleration without retraining or external draft models.

Abstract

Pruning accelerates compute-bound models by reducing computation. Recently applied to Vision-Language-Action (VLA) models, existing methods prune tokens using only local info from current action, ignoring global context from prior actions, causing >20% success rate drop and limited speedup. We observe high similarity across consecutive actions and propose leveraging both local (current) and global (past) info for smarter token selection. We introduce SpecPrune-VLA, a training-free method with two-level pruning and heuristic control: (1) Static pruning at action level: uses global history and local context to reduce visual tokens per action; (2) Dynamic pruning at layer level: prunes tokens per layer based on layer-specific importance; (3) Lightweight action-aware controller: classifies actions as coarse/fine-grained (by speed), adjusting pruning aggressiveness since fine-grained actions are pruning-sensitive. Experiments on LIBERO show SpecPrune-VLA achieves 1.46 times speedup on NVIDIA A800 and 1.57 times on NVIDIA GeForce RTX 3090 vs. OpenVLA-OFT, with negligible success rate loss.

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

TL;DR

Vision-Language-Action models are compute-bound, and existing token pruning often hurts success rate by using only local cues. SpecPrune-VLA introduces action-aware self-speculative pruning with static action-level pruning using global past information, dynamic layer-level pruning, and a lightweight controller to adapt pruning by action granularity. On LIBERO benchmarks, it achieves ~1.46× speedup on NVIDIA A800 and ~1.57× on RTX 3090 with negligible loss in success rate, demonstrating cross-platform effectiveness. The approach leverages temporal redundancy and multi-level token importance to deliver practical acceleration without retraining or external draft models.

Abstract

Pruning accelerates compute-bound models by reducing computation. Recently applied to Vision-Language-Action (VLA) models, existing methods prune tokens using only local info from current action, ignoring global context from prior actions, causing >20% success rate drop and limited speedup. We observe high similarity across consecutive actions and propose leveraging both local (current) and global (past) info for smarter token selection. We introduce SpecPrune-VLA, a training-free method with two-level pruning and heuristic control: (1) Static pruning at action level: uses global history and local context to reduce visual tokens per action; (2) Dynamic pruning at layer level: prunes tokens per layer based on layer-specific importance; (3) Lightweight action-aware controller: classifies actions as coarse/fine-grained (by speed), adjusting pruning aggressiveness since fine-grained actions are pruning-sensitive. Experiments on LIBERO show SpecPrune-VLA achieves 1.46 times speedup on NVIDIA A800 and 1.57 times on NVIDIA GeForce RTX 3090 vs. OpenVLA-OFT, with negligible success rate loss.

Paper Structure

This paper contains 48 sections, 17 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: (a) The mainstream inference dataflow of VLA models. (b) Latency breakdown in three typical VLA models in the LIBERO benchmark during each action generation. (c) The practical arithmetic intensity of three models in the roofline model of NVIDIA A800 GPU.
  • Figure 2: Overview of SpecPrune-VLA. We prune the image tokens at two levels with a lightweight action-aware controller.
  • Figure 3: (a) The original image the model sees; (b) The images where unimportant tokens are pruned; (c) The tokens are randomly pruned, some important tokens are pruned; (d) Important tokens(e.g., the tomato sauce) are pruned. (e) The influence of different pruning strategies and pruning numbers of tokens on performance.
  • Figure 4: For the task "turn on the oven and put the bottle on it". (a) Prune the tokens relying on local information(e.g., attention scores from one LLM layer). (b) Prune the tokens relying on the global attention of the LLM in the last action generation. (c) The practically important tokens after inference completion.
  • Figure 5: (a) Comparison of the hitrate between leveraging the first one, two, and three layers. (b) LLM latency comparison between leveraging the first one, two, and three layers.
  • ...and 6 more figures