ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context
Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, Jinwoo Shin
TL;DR
Robotic policies often require temporal context, but multi-frame training can degrade performance and incur high compute costs. ContextVLA introduces an amortized context token inside a Vision-Language Model backbone to summarize past observations into a single token, enabling efficient multi-frame action generation that supports autoregressive or diffusion decoders. The approach yields consistent improvements over single-frame VLAs and matches the benefits of full multi-frame training with substantially reduced training and inference time, demonstrated on Libero, Simpler-WidowX, Robocasa, and real-world tasks. This work advances practical generalist robot policies by efficiently encoding temporal context, with strong implications for real-time robotics and deployment feasibility.
Abstract
Leveraging temporal context is crucial for success in partially observable robotic tasks. However, prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations. In this paper, we introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations. Our approach is motivated by the key observation that Vision-Language-Action models (VLA), i.e., policy models built upon a Vision-Language Model (VLM), more effectively utilize multi-frame observations for action generation. This suggests that VLMs' inherent temporal understanding capability enables them to extract more meaningful context from multi-frame observations. However, the high dimensionality of video inputs introduces significant computational overhead, making VLA training and inference inefficient. To address this, ContextVLA compresses past observations into a single context token, allowing the policy to efficiently leverage temporal context for action generation. Our experiments show that ContextVLA consistently improves over single-frame VLAs and achieves the benefits of full multi-frame training but with reduced training and inference times.
