Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation
Xiaohuan Pei, Yuxing Chen, Siyu Xu, Yunke Wang, Yuheng Shi, Chang Xu
TL;DR
This work tackles the computational bottleneck of Vision-Language-Action models in robotic manipulation by introducing Action-aware Dynamic Pruning (ADP). ADP combines text-driven anticipatory pruning with an action-aware gating mechanism that adapts token retention based on recent end-effector motion, enabling aggressive pruning during coarse phases and preserving detail during fine-grained actions. Theoretical analysis and extensive experiments on LIBERO and real-robot tasks show significant FLOP reductions and faster inference with maintained or improved success rates, demonstrating a practical, plug-in approach to efficient VLA policies. Early-layer token scoring and dynamic gating emerge as key factors for balancing efficiency and precision across manipulation stages.
Abstract
Robotic manipulation with Vision-Language-Action models requires efficient inference over long-horizon multi-modal context, where attention to dense visual tokens dominates computational cost. Existing methods optimize inference speed by reducing visual redundancy within VLA models, but they overlook the varying redundancy across robotic manipulation stages. We observe that the visual token redundancy is higher in coarse manipulation phase than in fine-grained operations, and is strongly correlated with the action dynamic. Motivated by this observation, we propose \textbf{A}ction-aware \textbf{D}ynamic \textbf{P}runing (\textbf{ADP}), a multi-modal pruning framework that integrates text-driven token selection with action-aware trajectory gating. Our method introduces a gating mechanism that conditions the pruning signal on recent action trajectories, using past motion windows to adaptively adjust token retention ratios in accordance with dynamics, thereby balancing computational efficiency and perceptual precision across different manipulation stages. Extensive experiments on the LIBERO suites and diverse real-world scenarios demonstrate that our method significantly reduces FLOPs and action inference latency (\textit{e.g.} $1.35 \times$ speed up on OpenVLA-OFT) while maintaining competitive success rates (\textit{e.g.} 25.8\% improvements with OpenVLA) compared to baselines, thereby providing a simple plug-in path to efficient robot policies that advances the efficiency and performance frontier of robotic manipulation. Our project website is: \href{https://vla-adp.github.io/}{ADP.com}.
