Table of Contents
Fetching ...

Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models

Xudong Tan, Yaoxin Yang, Peng Ye, Jialin Zheng, Bizhe Bai, Xinyi Wang, Jia Hao, Tao Chen

TL;DR

<3-5 sentence high-level summary>FlashVLA tackles the high inference cost of Vision-Language-Action models by revealing two forms of redundancy: temporal action similarity and visual token redundancy. It introduces a training-free, plug-and-play framework with a token-aware action reuse mechanism and an information-contribution-based visual token pruning strategy, leveraging Flash Attention compatibility. The approach reduces visual-token FLOPs by 55.7% and latency by 36.0% on LIBERO with only a 0.7% drop in task success, without any retraining. This work enables lightweight, real-time VLA inference and broadens practical deployment of embodied agents in edge settings.

Abstract

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions. However, their high inference cost-stemming from large-scale token computation and autoregressive decoding-poses significant challenges for real-time deployment and edge applications. While prior work has primarily focused on architectural optimization, we take a different perspective by identifying a dual form of redundancy in VLA models: (i) high similarity across consecutive action steps, and (ii) substantial redundancy in visual tokens. Motivated by these observations, we propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models. FlashVLA improves inference efficiency through a token-aware action reuse mechanism that avoids redundant decoding across stable action steps, and an information-guided visual token selection strategy that prunes low-contribution tokens. Extensive experiments on the LIBERO benchmark show that FlashVLA reduces FLOPs by 55.7% and latency by 36.0%, with only a 0.7% drop in task success rate. These results demonstrate the effectiveness of FlashVLA in enabling lightweight, low-latency VLA inference without retraining.

Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models

TL;DR

<3-5 sentence high-level summary>FlashVLA tackles the high inference cost of Vision-Language-Action models by revealing two forms of redundancy: temporal action similarity and visual token redundancy. It introduces a training-free, plug-and-play framework with a token-aware action reuse mechanism and an information-contribution-based visual token pruning strategy, leveraging Flash Attention compatibility. The approach reduces visual-token FLOPs by 55.7% and latency by 36.0% on LIBERO with only a 0.7% drop in task success, without any retraining. This work enables lightweight, real-time VLA inference and broadens practical deployment of embodied agents in edge settings.

Abstract

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions. However, their high inference cost-stemming from large-scale token computation and autoregressive decoding-poses significant challenges for real-time deployment and edge applications. While prior work has primarily focused on architectural optimization, we take a different perspective by identifying a dual form of redundancy in VLA models: (i) high similarity across consecutive action steps, and (ii) substantial redundancy in visual tokens. Motivated by these observations, we propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models. FlashVLA improves inference efficiency through a token-aware action reuse mechanism that avoids redundant decoding across stable action steps, and an information-guided visual token selection strategy that prunes low-contribution tokens. Extensive experiments on the LIBERO benchmark show that FlashVLA reduces FLOPs by 55.7% and latency by 36.0%, with only a 0.7% drop in task success rate. These results demonstrate the effectiveness of FlashVLA in enabling lightweight, low-latency VLA inference without retraining.

Paper Structure

This paper contains 31 sections, 11 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Motivation behind our proposed FlashVLA. The figure shows the change in the VLA model’s output vector at each time step relative to the previous one. The vertical axis indicates the directional difference between consecutive actions. Most actions remain highly consistent with the previous step and appear in the stable area of the figure, while only a few exhibit significant changes.
  • Figure 2: Framework of our FlashVLA. We give the way our method works as the action step changes. Before each inference, FlashTrigger will think about whether it can reuse the output of the previous action based on action memory and token memory (as shown in blue block). If the trigger condition is met, this inference is skipped. If the trigger condition is not met, proceed to the pruned inference step. In pruned inference step, we select the set of important visual tokens in the prefill stage and prune the other unimportant tokens. After inference, action information and token information are used to update action memory and token memory.
  • Figure 3: Comparison of visual token selection strategies on a sample image. Left: patches selected using the proposed ICS. Right: patches selected uniformly at random. Patches selected by ICS tend to focus on semantically meaningful and information-dense regions.
  • Figure 4: FLOPs breakdown of FlashVLA across four LIBERO tasks under different visual token budgets. Each bar shows the cumulative reduction in FLOPs contributed by token pruning and computation reuse. FlashVLA consistently operates below the baseline FLOPs (dashed line), demonstrating the effectiveness of the dual-path acceleration strategy.
  • Figure 5: Trajectory of action. We visualize the trajectory of action in 3-dimensional space. The location of red dashed box illustrates the smoother trajectory of FlashVLA for the same task.
  • ...and 4 more figures