Table of Contents
Fetching ...

SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration

Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, Wenwu Zhu

TL;DR

SP-VLA introduces a joint scheduling and token-pruning framework to accelerate Vision-Language-Action models by exploiting temporal and spatial redundancies. It classifies actions as intuitive or deliberative and routes them to a lightweight generator or the VLA backbone, while performing spatio-semantic token pruning to preserve essential visual cues. Through action-type aware scheduling and dual-aware pruning, SP-VLA achieves notable speedups (1.5x lossless on LIBERO, 2.4x on SimplerEnv) with minimal or even positive accuracy changes, and substantial improvements in inference frequency and latency. The approach demonstrates strong potential for real-world deployment of VLA systems in embodied robotics and autonomous tasks, supported by extensive experiments and ablation analyses.

Abstract

Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities. However, their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation. Existing VLA acceleration methods primarily focus on structural optimization, overlooking the fact that these models operate in sequential decision-making environments. As a result, temporal redundancy in sequential action generation and spatial redundancy in visual input remain unaddressed. To this end, we propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens. Specifically, we design an action-aware model scheduling mechanism that reduces temporal redundancy by dynamically switching between VLA model and a lightweight generator. Inspired by the human motion pattern of focusing on key decision points while relying on intuition for other actions, we categorize VLA actions into deliberative and intuitive, assigning the former to the VLA model and the latter to the lightweight generator, enabling frequency-adaptive execution through collaborative model scheduling. To address spatial redundancy, we further develop a spatio-semantic dual-aware token pruning method. Tokens are classified into spatial and semantic types and pruned based on their dual-aware importance to accelerate VLA inference. These two mechanisms work jointly to guide the VLA in focusing on critical actions and salient visual information, achieving effective acceleration while maintaining high accuracy. Extensive experiments show that our method achieves 1.5$\times$ lossless acceleration in LIBERO and 2.4$\times$ in SimplerEnv, with up to 6% average performance gain. Inference frequency and latency improve by 2.2$\times$ in SimplerEnv and 1.4$\times$ in LIBERO.

SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration

TL;DR

SP-VLA introduces a joint scheduling and token-pruning framework to accelerate Vision-Language-Action models by exploiting temporal and spatial redundancies. It classifies actions as intuitive or deliberative and routes them to a lightweight generator or the VLA backbone, while performing spatio-semantic token pruning to preserve essential visual cues. Through action-type aware scheduling and dual-aware pruning, SP-VLA achieves notable speedups (1.5x lossless on LIBERO, 2.4x on SimplerEnv) with minimal or even positive accuracy changes, and substantial improvements in inference frequency and latency. The approach demonstrates strong potential for real-world deployment of VLA systems in embodied robotics and autonomous tasks, supported by extensive experiments and ablation analyses.

Abstract

Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities. However, their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation. Existing VLA acceleration methods primarily focus on structural optimization, overlooking the fact that these models operate in sequential decision-making environments. As a result, temporal redundancy in sequential action generation and spatial redundancy in visual input remain unaddressed. To this end, we propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens. Specifically, we design an action-aware model scheduling mechanism that reduces temporal redundancy by dynamically switching between VLA model and a lightweight generator. Inspired by the human motion pattern of focusing on key decision points while relying on intuition for other actions, we categorize VLA actions into deliberative and intuitive, assigning the former to the VLA model and the latter to the lightweight generator, enabling frequency-adaptive execution through collaborative model scheduling. To address spatial redundancy, we further develop a spatio-semantic dual-aware token pruning method. Tokens are classified into spatial and semantic types and pruned based on their dual-aware importance to accelerate VLA inference. These two mechanisms work jointly to guide the VLA in focusing on critical actions and salient visual information, achieving effective acceleration while maintaining high accuracy. Extensive experiments show that our method achieves 1.5 lossless acceleration in LIBERO and 2.4 in SimplerEnv, with up to 6% average performance gain. Inference frequency and latency improve by 2.2 in SimplerEnv and 1.4 in LIBERO.

Paper Structure

This paper contains 29 sections, 7 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: The main idea of SP-VLA. Unlike traditional VLA models, SP-VLA first determines the type of the current action. 1 For intuitive actions, a lightweight action generator is employed to approximate the output, while for deliberative actions, the high-precision VLA model is used to ensure accuracy. 2 When the VLA model is invoked, we further accelerate inference by adaptively pruning tokens based on integrated spatial and semantic information. By jointly leveraging the above two strategies, SP-VLA effectively directs the model’s attention to critical actions and salient visual information, achieving substantial speedup without compromising accuracy.
  • Figure 2: The visualization of VLA model behavior.(a) shows the velocity profile of the robot arm across 50 pick-and-place trials, following a consistent four-phase pattern: targeting, grasping, moving, and placing. The VLA model demonstrates complex behavior by adjusting orientation at key points and learning kinematic patterns such as acceleration and deceleration. These action sequences comprise both deliberative and intuitive components. (b) shows task performance under different token distributions. Random pruning degrades accuracy, highlighting the presence of token redundancy. However, relying exclusively on semantic importance, such as through reordering or semantic-aware pruning, causes the model to fail in completing the task. In contrast, integrating spatial and semantic information enables efficient pruning while preserving performance, as the VLA model relies on token relative positions and object contours for spatial understanding.
  • Figure 3: The framework of SP-VLA.SP-VLA accelerates the inference process through joint model scheduling and token pruning. Left: At each time step $t$, the scheduler classifies the current action as intuitive or deliberative based on the historial trajectories in the action buffer. For intuitive actions, Ridge Regression estimates the translational and rotational components, reusing the gripper state at $t-1$. Otherwise, the VLA model will generate a fine-grained action. Right: To support spatial understanding, we rank token importance by combining spatial information from the Canny operator with semantic importance, and perform velocity-adaptive pruning for optimal acceleration.
  • Figure 4: Visualizations of SP-VLA across different tasks. As shown in the figure, our method efficiently identifies redundant regions in the image and adaptively prunes tokens to accelerate VLA inference. At the same time, it effectively preserves object contour information, ensuring that the VLA model maintains its spatial perception capability.
  • Figure 5: Visualization examples generated by SP-VLA on LIBERO-Spatial.
  • ...and 11 more figures