Table of Contents
Fetching ...

Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation

Xiaohuan Pei, Yuxing Chen, Siyu Xu, Yunke Wang, Yuheng Shi, Chang Xu

TL;DR

This work tackles the computational bottleneck of Vision-Language-Action models in robotic manipulation by introducing Action-aware Dynamic Pruning (ADP). ADP combines text-driven anticipatory pruning with an action-aware gating mechanism that adapts token retention based on recent end-effector motion, enabling aggressive pruning during coarse phases and preserving detail during fine-grained actions. Theoretical analysis and extensive experiments on LIBERO and real-robot tasks show significant FLOP reductions and faster inference with maintained or improved success rates, demonstrating a practical, plug-in approach to efficient VLA policies. Early-layer token scoring and dynamic gating emerge as key factors for balancing efficiency and precision across manipulation stages.

Abstract

Robotic manipulation with Vision-Language-Action models requires efficient inference over long-horizon multi-modal context, where attention to dense visual tokens dominates computational cost. Existing methods optimize inference speed by reducing visual redundancy within VLA models, but they overlook the varying redundancy across robotic manipulation stages. We observe that the visual token redundancy is higher in coarse manipulation phase than in fine-grained operations, and is strongly correlated with the action dynamic. Motivated by this observation, we propose \textbf{A}ction-aware \textbf{D}ynamic \textbf{P}runing (\textbf{ADP}), a multi-modal pruning framework that integrates text-driven token selection with action-aware trajectory gating. Our method introduces a gating mechanism that conditions the pruning signal on recent action trajectories, using past motion windows to adaptively adjust token retention ratios in accordance with dynamics, thereby balancing computational efficiency and perceptual precision across different manipulation stages. Extensive experiments on the LIBERO suites and diverse real-world scenarios demonstrate that our method significantly reduces FLOPs and action inference latency (\textit{e.g.} $1.35 \times$ speed up on OpenVLA-OFT) while maintaining competitive success rates (\textit{e.g.} 25.8\% improvements with OpenVLA) compared to baselines, thereby providing a simple plug-in path to efficient robot policies that advances the efficiency and performance frontier of robotic manipulation. Our project website is: \href{https://vla-adp.github.io/}{ADP.com}.

Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation

TL;DR

This work tackles the computational bottleneck of Vision-Language-Action models in robotic manipulation by introducing Action-aware Dynamic Pruning (ADP). ADP combines text-driven anticipatory pruning with an action-aware gating mechanism that adapts token retention based on recent end-effector motion, enabling aggressive pruning during coarse phases and preserving detail during fine-grained actions. Theoretical analysis and extensive experiments on LIBERO and real-robot tasks show significant FLOP reductions and faster inference with maintained or improved success rates, demonstrating a practical, plug-in approach to efficient VLA policies. Early-layer token scoring and dynamic gating emerge as key factors for balancing efficiency and precision across manipulation stages.

Abstract

Robotic manipulation with Vision-Language-Action models requires efficient inference over long-horizon multi-modal context, where attention to dense visual tokens dominates computational cost. Existing methods optimize inference speed by reducing visual redundancy within VLA models, but they overlook the varying redundancy across robotic manipulation stages. We observe that the visual token redundancy is higher in coarse manipulation phase than in fine-grained operations, and is strongly correlated with the action dynamic. Motivated by this observation, we propose \textbf{A}ction-aware \textbf{D}ynamic \textbf{P}runing (\textbf{ADP}), a multi-modal pruning framework that integrates text-driven token selection with action-aware trajectory gating. Our method introduces a gating mechanism that conditions the pruning signal on recent action trajectories, using past motion windows to adaptively adjust token retention ratios in accordance with dynamics, thereby balancing computational efficiency and perceptual precision across different manipulation stages. Extensive experiments on the LIBERO suites and diverse real-world scenarios demonstrate that our method significantly reduces FLOPs and action inference latency (\textit{e.g.} speed up on OpenVLA-OFT) while maintaining competitive success rates (\textit{e.g.} 25.8\% improvements with OpenVLA) compared to baselines, thereby providing a simple plug-in path to efficient robot policies that advances the efficiency and performance frontier of robotic manipulation. Our project website is: \href{https://vla-adp.github.io/}{ADP.com}.

Paper Structure

This paper contains 19 sections, 26 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Action-aware dynamic pruning vs. static pruning. We visualize five past observation windows as cases from a manipulation episode that condition the current anticipatory window. $p_1$, $p_3$ and $p_5$ reflect coarse phases, prompting the gate to enable pruning to suppress redundant tokens, whereas $p_2$ and $p_4$ are delicate phases requiring detail vision context, so pruning is disabled and full vision is used. The curves depict robot's motion that drives the gating rule.
  • Figure 2: Overview of our proposed Action-aware Dynamic Pruning (ADP) for Vision-Language-Action models. (a.) Action-aware gating: the pruning function adaptively determines whether to prune based on recent end-effector trajectories ($\Delta x, \Delta y, \Delta z, \Delta \phi, \Delta \theta, \Delta \psi, g$) , enabling dynamic pruning. (b.) Anticipatory pruning: task-relevant visual tokens are selected via attention-based relevance, while redundant patches are discarded before entering the VLA backbone.
  • Figure 3: Text-driven Anticipatory Pruning. Step 1: Retrieval pretrained weights from Layer $l$ to compute relevance scores. Step 2: Treat text as a guider to prune vision tokens based on the ranking .
  • Figure 4: Visualisation of our method on representative examples of the four LIBERO task types (Spatial, Object, Goal, Long). Blue masks indicate pruned vision tokens. The retained tokens consistently highlight task-relevant objects, validating the Text-driven Anticipatory Pruning. Moreover, full vision tokens are restored at critical phases (e.g., initialisation, grasping, placement), demonstrating the effectiveness of the Task-driven Pruning and Action-Aware Dynamic Strategy.
  • Figure 5: Real world experiments. We conduct the experiments on Jaco2 Real-world Platform.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 4.1: Windowed FK for EEF Position