Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement
Weikang Qiu, Tinglin Huang, Aosong Feng, Rex Ying
TL;DR
Vision-Language-Action models struggle with long-horizon context and high inference cost due to quadratic attention in transformers. The paper introduces SD-VLA, which disentangles visual inputs into multi-level static tokens and dynamic tokens, enabling a single static token copy across frames and reusing cached static representations via a trainable recache gate; this reduces context length and accelerates rollouts. It also proposes LIBERO-Memory, a benchmark designed to evaluate true temporal dependency modeling in VLAs. Empirically, SD-VLA delivers a 39.8% absolute improvement in success on the LIBERO-Memory benchmark and up to 2.26× inference speedups on SimplerEnv, demonstrating improved long-horizon reasoning with practical efficiency gains for robotic control.
Abstract
Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: limited long-horizon context and inefficient inference due to the quadratic attention complexity and large parameter counts. Our work is motivated by the observation that much of the visual information in a trajectory remains static across timesteps (e.g., the background). Leveraging this property, we propose SD-VLA, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the key-value (KV) cache of static tokens through a lightweight recache gate that updates only when necessary. This design enables efficient multi-frame integration and efficient inference. In addition, we introduce a new benchmark that more effectively evaluates the long-horizon temporal dependency modeling ability of VLAs. Experimental results show that our approach outperforms baselines on this benchmark by 39.8% absolute improvement in success rate, and achieves a 3.9% gain on the SimplerEnv benchmark. Moreover, SD-VLA delivers a 2.26x inference speedup over the base VLA model on the same benchmark, enabling faster and more practical real-world deployment.
