VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models

Jintao Cheng; Haozhe Wang; Weibin Li; Gang Wang; Yipu Zhang; Xiaoyu Tang; Jin Wu; Xieyuanli Chen; Yunhui Liu; Wei Zhang

VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models

Jintao Cheng, Haozhe Wang, Weibin Li, Gang Wang, Yipu Zhang, Xiaoyu Tang, Jin Wu, Xieyuanli Chen, Yunhui Liu, Wei Zhang

Abstract

Vision-Language-Action (VLA) models have rapidly advanced embodied intelligence, enabling robots to execute complex, instruction-driven tasks. However, as model capacity and visual context length grow, the inference cost of VLA systems becomes a major bottleneck for real-world deployment on resource-constrained platforms. Existing visual token pruning methods mainly rely on semantic saliency or simple temporal cues, overlooking the continuous physical interaction, a fundamental property of VLA tasks. Consequently, current approaches often prune visually sparse yet structurally critical regions that support manipulation, leading to unstable behavior during early task phases. To overcome this, we propose a shift toward an explicit Interaction-First paradigm. Our proposed \textbf{training-free} method, VLA-IAP (Interaction-Aligned Pruning), introduces a geometric prior mechanism to preserve structural anchors and a dynamic scheduling strategy that adapts pruning intensity based on semantic-motion alignment. This enables a conservative-to-aggressive transition, ensuring robustness during early uncertainty and efficiency once interaction is locked. Extensive experiments show that VLA-IAP achieves a \textbf{97.8\% success rate} with a \textbf{$1.25\times$ speedup} on the LIBERO benchmark, and up to \textbf{$1.54\times$ speedup} while maintaining performance \textbf{comparable to the unpruned backbone}. Moreover, the method demonstrates superior and consistent performance across multiple model architectures and three different simulation environments, as well as a real robot platform, validating its strong generalization capability and practical applicability. Our project website is: \href{https://chengjt1999.github.io/VLA-IAP.github.io/}{VLA-IAP.com}.

VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models

Abstract

speedup} on the LIBERO benchmark, and up to \textbf{

speedup} while maintaining performance \textbf{comparable to the unpruned backbone}. Moreover, the method demonstrates superior and consistent performance across multiple model architectures and three different simulation environments, as well as a real robot platform, validating its strong generalization capability and practical applicability. Our project website is: \href{https://chengjt1999.github.io/VLA-IAP.github.io/}{VLA-IAP.com}.

Paper Structure (55 sections, 14 equations, 9 figures, 8 tables, 2 algorithms)

This paper contains 55 sections, 14 equations, 9 figures, 8 tables, 2 algorithms.

Introduction
Related Work
Vision-Language-Action Models
Visual Token Compression for VLA
Methodology
Overview
Geometric Prior for Edge Enhancement
Semantic-Motion Alignment Module
Semantic Prior.
Motion Prior.
Interaction-Aligned Dynamic Strategy.
Conservative Mode (Exploration Phase, $\text{IoU}_t \le \theta_{iou}$):
Aggressive Mode (Interaction Lock Phase, $\text{IoU}_t > \theta_{iou}$):
Final Visual Token Selection
Experiment
...and 40 more sections

Figures (9)

Figure 1: Comparison of Perception-First vs. Interaction-First token pruning paradigms. Perception-First baselines (Top) prematurely lose the manipulation target due to early semantic misalignment—a vulnerability that simple temporal stacking Temporal branch fails to resolve without explicit interaction modeling. In contrast, our VLA-IAP (Bottom) shifts to an Interaction-First approach. By coupling geometric priors with an IoU-aware dynamic strategy—transitioning from conservative (background-only) to aggressive (full) pruning—VLA-IAP successfully preserves the physical target for precise execution.
Figure 2: Overview of the proposed interaction-aligned dynamic strategy for vision--language action. Given consecutive visual frames and a language instruction, a vision encoder extracts patch features, while three complementary priors are constructed: semantic prior $S$, motion prior $M$ (via Gaussian modeling, history accumulation, and morphology), and geometric prior $G$ (Sobel-based edge enhancement). The priors are projected and fused, and an IoU-based alignment score is computed to adaptively select background-only filtering or conservative/aggressive masking, producing a union mask $(S \cup M)$. After final token selection, the resulting visual tokens are fed into a VLA LLM/policy to generate the robot action.
Figure 3: Overview of the evaluation benchmarks and tasks. We evaluate on simulated benchmarks including Libero, VLABench, and CALVIN ABC-D, as well as real-world tasks covering simple, long-horizon, and dual-arm manipulation.
Figure 4: Real Robot Experiment Setup
Figure 5: Visualization of Interaction-aligned Pruning Process on LIBERO. (Bottom) Dynamic shift in visual token retention from conservative to aggressive mode. (Middle) The alignment score (IoU) regulating the pruning state. (Top) The overlap (purple) between semantic intent (red) and arm motion (blue) masks that drives the alignment score.
...and 4 more figures

VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models

Abstract

VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models

Authors

Abstract

Table of Contents

Figures (9)