Table of Contents
Fetching ...

Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

Weikang Qiu, Tinglin Huang, Aosong Feng, Rex Ying

TL;DR

Vision-Language-Action models struggle with long-horizon context and high inference cost due to quadratic attention in transformers. The paper introduces SD-VLA, which disentangles visual inputs into multi-level static tokens and dynamic tokens, enabling a single static token copy across frames and reusing cached static representations via a trainable recache gate; this reduces context length and accelerates rollouts. It also proposes LIBERO-Memory, a benchmark designed to evaluate true temporal dependency modeling in VLAs. Empirically, SD-VLA delivers a 39.8% absolute improvement in success on the LIBERO-Memory benchmark and up to 2.26× inference speedups on SimplerEnv, demonstrating improved long-horizon reasoning with practical efficiency gains for robotic control.

Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: limited long-horizon context and inefficient inference due to the quadratic attention complexity and large parameter counts. Our work is motivated by the observation that much of the visual information in a trajectory remains static across timesteps (e.g., the background). Leveraging this property, we propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the key-value (KV) cache of static tokens through a lightweight recache gate that updates only when necessary. This design enables efficient multi-frame integration and efficient inference. In addition, we introduce a new benchmark that more effectively evaluates the long-horizon temporal dependency modeling ability of VLAs. Experimental results show that our approach outperforms baselines on this benchmark by 39.8% absolute improvement in success rate, and achieves a 3.9% gain on the SimplerEnv benchmark. Moreover, SD-VLA delivers a 2.26x inference speedup over the base VLA model on the same benchmark, enabling faster and more practical real-world deployment.

Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

TL;DR

Vision-Language-Action models struggle with long-horizon context and high inference cost due to quadratic attention in transformers. The paper introduces SD-VLA, which disentangles visual inputs into multi-level static tokens and dynamic tokens, enabling a single static token copy across frames and reusing cached static representations via a trainable recache gate; this reduces context length and accelerates rollouts. It also proposes LIBERO-Memory, a benchmark designed to evaluate true temporal dependency modeling in VLAs. Empirically, SD-VLA delivers a 39.8% absolute improvement in success on the LIBERO-Memory benchmark and up to 2.26× inference speedups on SimplerEnv, demonstrating improved long-horizon reasoning with practical efficiency gains for robotic control.

Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: limited long-horizon context and inefficient inference due to the quadratic attention complexity and large parameter counts. Our work is motivated by the observation that much of the visual information in a trajectory remains static across timesteps (e.g., the background). Leveraging this property, we propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the key-value (KV) cache of static tokens through a lightweight recache gate that updates only when necessary. This design enables efficient multi-frame integration and efficient inference. In addition, we introduce a new benchmark that more effectively evaluates the long-horizon temporal dependency modeling ability of VLAs. Experimental results show that our approach outperforms baselines on this benchmark by 39.8% absolute improvement in success rate, and achieves a 3.9% gain on the SimplerEnv benchmark. Moreover, SD-VLA delivers a 2.26x inference speedup over the base VLA model on the same benchmark, enabling faster and more practical real-world deployment.
Paper Structure (25 sections, 9 equations, 8 figures, 7 tables)

This paper contains 25 sections, 9 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Our model solves two main challenges in existing VLAs with the proposed static-dynamic disentanglement. 1) By keeping one copy of static tokens across all timesteps, our model is able to squash observations of multiple steps to the model's context; 2) By moving static tokens in front of all dynamic tokens, our model could reuse the KV-cache of previous timesteps during rollouts.
  • Figure 2: Some of the existing methods xu2025vlaliu2025ttf that exploit to reuse information in previous frames ignore the problem that the static patches will be affected by attentions even if they are identical in the original pixel space. Unlike these methods, our model learns to disentangle and put static tokens before the dynamic tokens, making sure the static tokens will not be affected due to the causal attention mechanism of LLM backbones.
  • Figure 3: (a) Model architecture overview. We illustrate the design using two levels of static cache. At each level, a recache gate determines whether the cached static tokens should be reused or refreshed. If the L1 cache is refreshed, the L2 cache is also forcibly refreshed. b Contrastive loss used to train static tokens to be temporally persistent. Observations from the same trajectory form positive pairs, while observations from different trajectories form negative pairs.
  • Figure 4: An example of the procedure of the proposed benchmark.
  • Figure 5: Attention map visualization across time. For each token, we compute its last-layer attention to image patches and upsample the result to the full image resolution to produce a heatmap. Heatmaps are averaged over tokens of the same type and displayed by row (e.g., the Dynamic row shows the average attention heatmap of all dynamic tokens).
  • ...and 3 more figures