Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

Yifan Ye; Jiaqi Ma; Jun Cen; Zhihe Lu

Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

Yifan Ye, Jiaqi Ma, Jun Cen, Zhihe Lu

TL;DR

The paper addresses the computational bottleneck of Vision-Language-Action models by introducing TEAM-VLA, a training-free token compression framework that densifies sparse foreground cues and applies mid-layer, action-guided merging to reduce tokens without retraining. It operates solely on the current frame, avoiding temporal buffering or learning updates, and combines a fast Token Expanding module with a soft bipartite Merging mechanism to preserve essential semantics. Evaluations on the LIBERO benchmark demonstrate substantial latency reductions (roughly 1.5x) while maintaining or improving task success compared to strong baselines, highlighting practical viability for real-time robotic control. Overall, TEAM-VLA delivers a practical, training-free pathway to deploy robust VLA systems in dynamic environments with limited computational resources.

Abstract

Vision-Language-Action (VLA) models pretrained on large-scale multimodal datasets have emerged as powerful foundations for robotic perception and control. However, their massive scale, often billions of parameters, poses significant challenges for real-time deployment, as inference becomes computationally expensive and latency-sensitive in dynamic environments. To address this, we propose Token Expand-and-Merge-VLA (TEAM-VLA), a training-free token compression framework that accelerates VLA inference while preserving task performance. TEAM-VLA introduces a dynamic token expansion mechanism that identifies and samples additional informative tokens in the spatial vicinity of attention-highlighted regions, enhancing contextual completeness. These expanded tokens are then selectively merged in deeper layers under action-aware guidance, effectively reducing redundancy while maintaining semantic coherence. By coupling expansion and merging within a single feed-forward pass, TEAM-VLA achieves a balanced trade-off between efficiency and effectiveness, without any retraining or parameter updates. Extensive experiments on LIBERO benchmark demonstrate that TEAM-VLA consistently improves inference speed while maintaining or even surpassing the task success rate of full VLA models. The code is public available on \href{https://github.com/Jasper-aaa/TEAM-VLA}{https://github.com/Jasper-aaa/TEAM-VLA}

Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

TL;DR

Abstract

Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)