Table of Contents
Fetching ...

VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching

Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, Chang Xu

TL;DR

VLA-Cache tackles the real-time inference bottleneck of Vision-Language-Action robotic systems by introducing a training-free cross-frame token caching mechanism. It selectively reuses visually static tokens across adjacent frames while evicting semantically critical tokens via decoder attention, and applies a layer-adaptive reuse strategy informed by attention entropy to balance efficiency and accuracy. Through extensive evaluation on LIBERO, SIMPLER, and a Kinova Jaco2 real robot, it achieves up to 1.7x CUDA latency speedups and notable control-frequency gains with minimal task-performance loss, demonstrating robust cross-platform applicability. The approach offers a practical, plug-in optimization that complements existing high-frequency VLA architectures without retraining.

Abstract

Vision-Language-Action (VLA) models have demonstrated strong multi-modal reasoning capabilities, enabling direct action generation from visual perception and language instructions in an end-to-end manner. However, their substantial computational cost poses a challenge for real-time robotic control, where rapid decision-making is essential. This paper introduces VLA-Cache, a training-free inference acceleration method that reduces computational overhead by adaptively caching and reusing static visual tokens across frames. Exploiting the temporal continuity in robotic manipulation, VLA-Cache identifies minimally changed tokens between adjacent frames and reuses their cached key-value representations, thereby circumventing redundant computations. Additionally, to maintain action precision, VLA-Cache selectively re-computes task-relevant tokens that are environmentally sensitive, ensuring the fidelity of critical visual information. To further optimize efficiency, we introduce a layer adaptive token reusing strategy that dynamically adjusts the reuse ratio based on attention concentration across decoder layers, prioritizing critical tokens for recomputation. Extensive experiments on two simulation platforms (LIBERO and SIMPLER) and a real-world robotic system demonstrate that VLA-Cache achieves up to 1.7x speedup in CUDA latency and a 15% increase in control frequency, with negligible loss on task success rate. The code and videos can be found at our project page: https://vla-cache.github.io.

VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching

TL;DR

VLA-Cache tackles the real-time inference bottleneck of Vision-Language-Action robotic systems by introducing a training-free cross-frame token caching mechanism. It selectively reuses visually static tokens across adjacent frames while evicting semantically critical tokens via decoder attention, and applies a layer-adaptive reuse strategy informed by attention entropy to balance efficiency and accuracy. Through extensive evaluation on LIBERO, SIMPLER, and a Kinova Jaco2 real robot, it achieves up to 1.7x CUDA latency speedups and notable control-frequency gains with minimal task-performance loss, demonstrating robust cross-platform applicability. The approach offers a practical, plug-in optimization that complements existing high-frequency VLA architectures without retraining.

Abstract

Vision-Language-Action (VLA) models have demonstrated strong multi-modal reasoning capabilities, enabling direct action generation from visual perception and language instructions in an end-to-end manner. However, their substantial computational cost poses a challenge for real-time robotic control, where rapid decision-making is essential. This paper introduces VLA-Cache, a training-free inference acceleration method that reduces computational overhead by adaptively caching and reusing static visual tokens across frames. Exploiting the temporal continuity in robotic manipulation, VLA-Cache identifies minimally changed tokens between adjacent frames and reuses their cached key-value representations, thereby circumventing redundant computations. Additionally, to maintain action precision, VLA-Cache selectively re-computes task-relevant tokens that are environmentally sensitive, ensuring the fidelity of critical visual information. To further optimize efficiency, we introduce a layer adaptive token reusing strategy that dynamically adjusts the reuse ratio based on attention concentration across decoder layers, prioritizing critical tokens for recomputation. Extensive experiments on two simulation platforms (LIBERO and SIMPLER) and a real-world robotic system demonstrate that VLA-Cache achieves up to 1.7x speedup in CUDA latency and a 15% increase in control frequency, with negligible loss on task success rate. The code and videos can be found at our project page: https://vla-cache.github.io.

Paper Structure

This paper contains 45 sections, 18 equations, 8 figures, 11 tables, 2 algorithms.

Figures (8)

  • Figure 1: During the inference of the VLA model, static tokens of the input image remain largely consistent across steps. This consistency allows for caching the computations of these tokens from the previous step.
  • Figure 2: VLA-Cache accelerates the VLA's language decoding process across timesteps via the following two steps: (a) Dynamic Token Selection reuses static tokens across frames while preserving task-relevant ones; (b) Adaptive Token Caching dynamically adjusts reuse ratios per decoder layer based on attention patterns.
  • Figure 3: Tasks on LIBERO Benchmark, the SIMPLER Environment and Real World.
  • Figure 4: Visualization of VLA-Cache token reuse across settings. (a) LIBERO simulation with OpenVLA. (b) Real-world task under dynamic background. (c) and (d) Main and wrist camera views from OpenVLA-OFT. Blue: static tokens, Yellow: task-relevant, Red: overlapping. VLA-Cache reduces redundant computation and preserves accuracy under varying conditions.
  • Figure 5: VLA-Cache test results and attention heat map in a simulated environment
  • ...and 3 more figures