Table of Contents
Fetching ...

CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, Jiangmiao Pang

TL;DR

CronusVLA presents a two-stage framework to efficiently extend single-frame vision-language-action models to multi-frame manipulation tasks. It first pretrains on large embodied datasets, then performs post-training to convert discrete action tokens into learnable features with a cross-frame decoder and feature chunking, enabling fast inference. The approach achieves state-of-the-art results on SimplerEnv and LIBERO benchmarks and demonstrates strong robustness on the novel SimplerEnv-OR disturbance suite, with validated real-world performance on a Franka robot. Collectively, CronusVLA offers a practical pathway to robust, long-horizon robotic manipulation with efficient multi-frame temporal modeling.

Abstract

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.

CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

TL;DR

CronusVLA presents a two-stage framework to efficiently extend single-frame vision-language-action models to multi-frame manipulation tasks. It first pretrains on large embodied datasets, then performs post-training to convert discrete action tokens into learnable features with a cross-frame decoder and feature chunking, enabling fast inference. The approach achieves state-of-the-art results on SimplerEnv and LIBERO benchmarks and demonstrates strong robustness on the novel SimplerEnv-OR disturbance suite, with validated real-world performance on a Franka robot. Collectively, CronusVLA offers a practical pathway to robust, long-horizon robotic manipulation with efficient multi-frame temporal modeling.

Abstract

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.

Paper Structure

This paper contains 41 sections, 6 equations, 24 figures, 21 tables.

Figures (24)

  • Figure 1: CronusVLA is a multi-frame modeling framework that includes single-frame pretraining on large-scale manipulation datasets and multi-frame post-training on cross-embodiment datasets. CronusVLA shows fast inference, high performance in simulation benchmarks and real-world experiments, and better observational robustness.
  • Figure 2: Overview of CronusVLA framework. (a) illustrates the single-frame pretraining of the basic single-frame VLA. By duplicating the model weights, we perform multi-frame post-training as shown in (b), where multi-frame modeling is achieved by aggregating learnable features from several preceding frames in a cross-frame decoder. In (c), a queue mechanism is conducted on feature chunking for fast inference. Details of the cross-frame decoder are illustrated in (d).
  • Figure 3: An illustration of the SimplerEnv-OR benchmark.
  • Figure 4: Real-world experiment. Evaluation of basic pick-and-place capabilities is in (a), long-horizon tasks of (b) demonstrate the advantages of multi-frame modeling in handling temporally dependent manipulations, and (c) generalization and robustness tests, particularly under camera occlusion and various disturbances, highlight the robustness of our model.
  • Figure 5: Impact of frame number. Varying the number of input frames affects the success rate and inference speed.
  • ...and 19 more figures