Table of Contents
Fetching ...

ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context

Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, Jinwoo Shin

TL;DR

Robotic policies often require temporal context, but multi-frame training can degrade performance and incur high compute costs. ContextVLA introduces an amortized context token inside a Vision-Language Model backbone to summarize past observations into a single token, enabling efficient multi-frame action generation that supports autoregressive or diffusion decoders. The approach yields consistent improvements over single-frame VLAs and matches the benefits of full multi-frame training with substantially reduced training and inference time, demonstrated on Libero, Simpler-WidowX, Robocasa, and real-world tasks. This work advances practical generalist robot policies by efficiently encoding temporal context, with strong implications for real-time robotics and deployment feasibility.

Abstract

Leveraging temporal context is crucial for success in partially observable robotic tasks. However, prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations. In this paper, we introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations. Our approach is motivated by the key observation that Vision-Language-Action models (VLA), i.e., policy models built upon a Vision-Language Model (VLM), more effectively utilize multi-frame observations for action generation. This suggests that VLMs' inherent temporal understanding capability enables them to extract more meaningful context from multi-frame observations. However, the high dimensionality of video inputs introduces significant computational overhead, making VLA training and inference inefficient. To address this, ContextVLA compresses past observations into a single context token, allowing the policy to efficiently leverage temporal context for action generation. Our experiments show that ContextVLA consistently improves over single-frame VLAs and achieves the benefits of full multi-frame training but with reduced training and inference times.

ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context

TL;DR

Robotic policies often require temporal context, but multi-frame training can degrade performance and incur high compute costs. ContextVLA introduces an amortized context token inside a Vision-Language Model backbone to summarize past observations into a single token, enabling efficient multi-frame action generation that supports autoregressive or diffusion decoders. The approach yields consistent improvements over single-frame VLAs and matches the benefits of full multi-frame training with substantially reduced training and inference time, demonstrated on Libero, Simpler-WidowX, Robocasa, and real-world tasks. This work advances practical generalist robot policies by efficiently encoding temporal context, with strong implications for real-time robotics and deployment feasibility.

Abstract

Leveraging temporal context is crucial for success in partially observable robotic tasks. However, prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations. In this paper, we introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations. Our approach is motivated by the key observation that Vision-Language-Action models (VLA), i.e., policy models built upon a Vision-Language Model (VLM), more effectively utilize multi-frame observations for action generation. This suggests that VLMs' inherent temporal understanding capability enables them to extract more meaningful context from multi-frame observations. However, the high dimensionality of video inputs introduces significant computational overhead, making VLA training and inference inefficient. To address this, ContextVLA compresses past observations into a single context token, allowing the policy to efficiently leverage temporal context for action generation. Our experiments show that ContextVLA consistently improves over single-frame VLAs and achieves the benefits of full multi-frame training but with reduced training and inference times.

Paper Structure

This paper contains 50 sections, 2 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Overview. (a) Many robotic tasks require temporal context to generate accurate actions. (b) By leveraging multi-frame observations, our proposed method, ContextVLA, achieves higher averaged success rates (%) over all baseline policies on real-world robotic tasks. (c) Moreover, our framework gets benefits of multi-frame training with reduced inference latency.
  • Figure 2: Effect of multi-frame observations for training various policy models. We report the success rates (%) of various policy models fine-tuned on Square task from the Robomimic benchmark mandlekar2021matters. (a) When training policy models using multi-frame observations, traditional policy model (Diffusion policy) shows significant performance degradation, whereas recent Vision-Language-Action models (VLA; $\pi_0$ and GR00T N1.5) do not. (b) We find that the key factor in overcoming this problem is leveraging a pre-trained Vision-Language Model (VLM) to extract temporal information for action generation. ViT, VLM, and VLA-init indicate how the VLA architecture is initialized for training; we use a pre-trained vision encoder, VLM, or VLA, respectively, and other parameters are randomly initialized.
  • Figure 3: Overview of ContextVLA. We design an efficient Vision-Language-Action model (VLA) that generates actions using multi-frame visual observations. We use a Vision-Language Model (VLM) to encode observations ${\mathbf{o}}_{t-k:t}$, where we compress past observations ${\mathbf{o}}_{t-k:t-1}$ into a single context token ${\mathbf{m}}$ at the VLM block $n$. We then leverage the VLM features to generate actions via either autoregressive modeling or diffusion-based modeling.
  • Figure 4: Examples of visual observations from the evaluation tasks. (a) We consider simulated robotic manipulation tasks from Libero liu2023libero, Simpler-WidowX li2024evaluating, and Robocasa nasiriany2024robocasa. (b) We design real-world robotic tasks: Clench/unclench hand (Clench/Unclench), pick-and-place twice (PnP Twice), and cover and stack (CoverNStack).
  • Figure 5: Training efficiency. We report the wall clock time of fine-tuning $\pi_0$ on Libero liu2023libero using 4 NVIDIA A100 80GB GPU.
  • ...and 3 more figures