Table of Contents
Fetching ...

Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

Yifan Ye, Jiaqi Ma, Jun Cen, Zhihe Lu

TL;DR

The paper addresses the computational bottleneck of Vision-Language-Action models by introducing TEAM-VLA, a training-free token compression framework that densifies sparse foreground cues and applies mid-layer, action-guided merging to reduce tokens without retraining. It operates solely on the current frame, avoiding temporal buffering or learning updates, and combines a fast Token Expanding module with a soft bipartite Merging mechanism to preserve essential semantics. Evaluations on the LIBERO benchmark demonstrate substantial latency reductions (roughly 1.5x) while maintaining or improving task success compared to strong baselines, highlighting practical viability for real-time robotic control. Overall, TEAM-VLA delivers a practical, training-free pathway to deploy robust VLA systems in dynamic environments with limited computational resources.

Abstract

Vision-Language-Action (VLA) models pretrained on large-scale multimodal datasets have emerged as powerful foundations for robotic perception and control. However, their massive scale, often billions of parameters, poses significant challenges for real-time deployment, as inference becomes computationally expensive and latency-sensitive in dynamic environments. To address this, we propose Token Expand-and-Merge-VLA (TEAM-VLA), a training-free token compression framework that accelerates VLA inference while preserving task performance. TEAM-VLA introduces a dynamic token expansion mechanism that identifies and samples additional informative tokens in the spatial vicinity of attention-highlighted regions, enhancing contextual completeness. These expanded tokens are then selectively merged in deeper layers under action-aware guidance, effectively reducing redundancy while maintaining semantic coherence. By coupling expansion and merging within a single feed-forward pass, TEAM-VLA achieves a balanced trade-off between efficiency and effectiveness, without any retraining or parameter updates. Extensive experiments on LIBERO benchmark demonstrate that TEAM-VLA consistently improves inference speed while maintaining or even surpassing the task success rate of full VLA models. The code is public available on \href{https://github.com/Jasper-aaa/TEAM-VLA}{https://github.com/Jasper-aaa/TEAM-VLA}

Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

TL;DR

The paper addresses the computational bottleneck of Vision-Language-Action models by introducing TEAM-VLA, a training-free token compression framework that densifies sparse foreground cues and applies mid-layer, action-guided merging to reduce tokens without retraining. It operates solely on the current frame, avoiding temporal buffering or learning updates, and combines a fast Token Expanding module with a soft bipartite Merging mechanism to preserve essential semantics. Evaluations on the LIBERO benchmark demonstrate substantial latency reductions (roughly 1.5x) while maintaining or improving task success compared to strong baselines, highlighting practical viability for real-time robotic control. Overall, TEAM-VLA delivers a practical, training-free pathway to deploy robust VLA systems in dynamic environments with limited computational resources.

Abstract

Vision-Language-Action (VLA) models pretrained on large-scale multimodal datasets have emerged as powerful foundations for robotic perception and control. However, their massive scale, often billions of parameters, poses significant challenges for real-time deployment, as inference becomes computationally expensive and latency-sensitive in dynamic environments. To address this, we propose Token Expand-and-Merge-VLA (TEAM-VLA), a training-free token compression framework that accelerates VLA inference while preserving task performance. TEAM-VLA introduces a dynamic token expansion mechanism that identifies and samples additional informative tokens in the spatial vicinity of attention-highlighted regions, enhancing contextual completeness. These expanded tokens are then selectively merged in deeper layers under action-aware guidance, effectively reducing redundancy while maintaining semantic coherence. By coupling expansion and merging within a single feed-forward pass, TEAM-VLA achieves a balanced trade-off between efficiency and effectiveness, without any retraining or parameter updates. Extensive experiments on LIBERO benchmark demonstrate that TEAM-VLA consistently improves inference speed while maintaining or even surpassing the task success rate of full VLA models. The code is public available on \href{https://github.com/Jasper-aaa/TEAM-VLA}{https://github.com/Jasper-aaa/TEAM-VLA}

Paper Structure

This paper contains 32 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: A visual comparison of three foreground extraction strategies shows that our similarity-driven expansion achieves the most coherent foreground regions while maintaining superior efficiency and zero-buffer overhead.
  • Figure 2: The overall pipeline of TEAM consists of two stages. First, before tokens enter the language backbone, we perform token pruning. This begins with two complementary sampling steps: (1) similarity-based sampling and (2) context sampling, followed by a spatial expansion module that densifies the sparse similarity map. The expanded mask is then used to remove redundant visual tokens while preserving task-relevant regions. Second, at a middle layer of the backbone, we introduce an action-guided soft bipartite matching module that merges tokens through weighted averaging, effectively compressing deep representations while retaining essential semantic and action-related information.
  • Figure 3: We visualize the density distribution of the feature map $F$ to illustrate how attended regions aggregate spatially. On the binary mask, we apply a convolutional operation to obtain a density feature map $F$. We then expand the regions based on their density values, where the areas with the highest density are fully expanded to cover their corresponding spatial neighborhoods.
  • Figure 4: Panels (a)–(d) show visualizations of different tasks in the LIBERO-10 benchmark. For each task, the top row displays the sparse similarity mask, while the bottom row presents the corresponding expanded mask.
  • Figure 5: We further compare the number of remaining tokens across several mainstream methods, reporting for TEAM-VLA the average token count after the merging stage. As shown, TEAM-VLA retains substantially fewer tokens than other training-free (TF) approaches while achieving significantly higher performance.