Table of Contents
Fetching ...

VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao

TL;DR

VLA-Pruner tackles the real-time inference bottleneck of Vision-Language-Action (VLA) models by introducing a training-free, dual-level visual token pruning method that accounts for both semantic understanding and action execution. It uses vision-language prefill attention for semantic relevance and temporally smoothed action decode attention for action relevance, combining them through a patch-wise Max-Relevance and Min-Redundancy (mRMR) strategy to select a compact token set of size $ ilde{M}$ from $M$. The approach is plug-and-play across OpenVLA, OpenVLA-OFT, and $cpi_0$ architectures, delivering up to $ imes 1.8$ speedups with minimal performance loss and even improved performance at moderate pruning, validated on LIBERO, SIMPLER, and a real robot. This work enables practical, real-time embodied AI by efficiently balancing semantic grounding with precise motor control under tight compute budgets, while maintaining generalizability across architectures and tasks. All mathematical notation, including $M$, $N$, $ ilde{M}$, $ ho$, $w$, and $ ilde{S}_{act}$, is presented with proper delimiters.

Abstract

Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.

VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

TL;DR

VLA-Pruner tackles the real-time inference bottleneck of Vision-Language-Action (VLA) models by introducing a training-free, dual-level visual token pruning method that accounts for both semantic understanding and action execution. It uses vision-language prefill attention for semantic relevance and temporally smoothed action decode attention for action relevance, combining them through a patch-wise Max-Relevance and Min-Redundancy (mRMR) strategy to select a compact token set of size from . The approach is plug-and-play across OpenVLA, OpenVLA-OFT, and architectures, delivering up to speedups with minimal performance loss and even improved performance at moderate pruning, validated on LIBERO, SIMPLER, and a real robot. This work enables practical, real-time embodied AI by efficiently balancing semantic grounding with precise motor control under tight compute budgets, while maintaining generalizability across architectures and tasks. All mathematical notation, including , , , , , and , is presented with proper delimiters.

Abstract

Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.

Paper Structure

This paper contains 49 sections, 16 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Comparison of different visual token pruning/caching methods across various pruning/caching ratios for OpenVLA kim2024openvla. The y-axis is the success rate averaged on the LIBERO liu2023libero benchmark. The proposed method significantly outperforms all baselines, especially at high pruning/caching ratios. At 50% ratio, our method can even improve performance of OpenVLA.
  • Figure 2: Distinct attention patterns across VLA inference (see Sec. \ref{['sec:observation']}). (a–b) Overlap ratios of $\text{Top-}k$ attended patches: (i) between vision–language prefill and action decode $(\text{Top-}k(\mathcal{S}_{\text{vl}})$ vs. $\text{Top-}k(\mathcal{S}_{\text{act}}))$, and (ii) between consecutive action-decode timesteps $(\text{Top-}k(\mathcal{S}^t_{\text{act}})$ vs. $\text{Top-}k(\mathcal{S}^{t-1}_{\text{act}}))$. We show the average values (a) and a representative rollout (b). (c–d) Visualization of prefill (c) and action decode (d) attention over the same frame; top 12.5% (yellow), 25% (orange), and 50% (purple) patches are overlaid. Prefilling shows broad semantic coverage, while action decoding is locally focused. The results reveal VLA model’s dual-system nature that token pruning must consider.
  • Figure 3: Overall pipeline of VLA-Pruner illustrated with token budget $k$$=$3. It adopts (a) dual-level token importance criterion that incorporates semantic-level and action-level importance and (b) dual-level token selection strategy that combines max-relevance and min-redundancy principle. See Sec. \ref{['sec:vla-pruner']} for details.
  • Figure 4: Performance of OpenVLA with pruning/caching methods across LIBERO tasks under varying pruning/caching ratio. The horizontal axis represents the pruning/caching ratios of visual tokens, and the vertical axis shows the success rates. VLA-Pruner performs best, especially as ratio increases.
  • Figure 5: Performance of VLA-Pruner on OpenVLA-OFT for real-robot tasks under a 75% prune ratio. We conduct experiments using a 6-DoF xArm6 robotic arm. VLA-Pruner achieves consistent superiority in preserving model performance, demonstrating its practical advantage.
  • ...and 7 more figures

Theorems & Definitions (3)

  • Remark 1
  • Remark 2
  • Remark 3