SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

Hanzhen Wang; Jiaming Xu; Jiayi Pan; Yongkang Zhou; Guohao Dai

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

Hanzhen Wang, Jiaming Xu, Jiayi Pan, Yongkang Zhou, Guohao Dai

TL;DR

Vision-Language-Action models are compute-bound, and existing token pruning often hurts success rate by using only local cues. SpecPrune-VLA introduces action-aware self-speculative pruning with static action-level pruning using global past information, dynamic layer-level pruning, and a lightweight controller to adapt pruning by action granularity. On LIBERO benchmarks, it achieves ~1.46× speedup on NVIDIA A800 and ~1.57× on RTX 3090 with negligible loss in success rate, demonstrating cross-platform effectiveness. The approach leverages temporal redundancy and multi-level token importance to deliver practical acceleration without retraining or external draft models.

Abstract

Pruning accelerates compute-bound models by reducing computation. Recently applied to Vision-Language-Action (VLA) models, existing methods prune tokens using only local info from current action, ignoring global context from prior actions, causing >20% success rate drop and limited speedup. We observe high similarity across consecutive actions and propose leveraging both local (current) and global (past) info for smarter token selection. We introduce SpecPrune-VLA, a training-free method with two-level pruning and heuristic control: (1) Static pruning at action level: uses global history and local context to reduce visual tokens per action; (2) Dynamic pruning at layer level: prunes tokens per layer based on layer-specific importance; (3) Lightweight action-aware controller: classifies actions as coarse/fine-grained (by speed), adjusting pruning aggressiveness since fine-grained actions are pruning-sensitive. Experiments on LIBERO show SpecPrune-VLA achieves 1.46 times speedup on NVIDIA A800 and 1.57 times on NVIDIA GeForce RTX 3090 vs. OpenVLA-OFT, with negligible success rate loss.

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

TL;DR

Abstract

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)