Table of Contents
Fetching ...

AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models

Xiaoquan Sun, Zetian Xu, Chen Cao, Zonghe Liu, Yihan Sun, Jingrui Pang, Ruijian Zhang, Zhen Yang, Kang Pang, Dingxin He, Mingqi Yuan, Jiayu Chen

TL;DR

The first subtask-aware VLA framework integrated with a scalable offline post-training pipeline is proposed, which leverages a large language model to decompose high-level demonstrations into fine-grained atomic subtasks and enables highly efficient Group Relative Policy Optimization without the prohibitive expenses associated with online rollouts on physical robots.

Abstract

Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The execution of complex multi-step behaviors in VLA models can be improved by robust instruction grounding, a critical component for effective control. However, current paradigms predominantly rely on coarse, high-level task instructions during supervised fine-tuning. This instruction grounding gap leaves models without explicit intermediate guidance, leading to severe compounding errors in long-horizon tasks. Therefore, bridging this instruction gap and providing scalable post-training for VLA models is urgent. To tackle this problem, we propose \method, the first subtask-aware VLA framework integrated with a scalable offline post-training pipeline. Our framework leverages a large language model to decompose high-level demonstrations into fine-grained atomic subtasks. This approach utilizes a pretrained predictive world model to score candidate action chunks against subtask goals in the latent space, mitigating error accumulation while significantly improving long-horizon robustness. Furthermore, this approach enables highly efficient Group Relative Policy Optimization without the prohibitive expenses associated with online rollouts on physical robots. Extensive simulations validate that our AtomVLA maintains strong robustness under perturbations. When evaluated against fundamental baseline models, it achieves an average success rate of 97.0\% on the LIBERO benchmark and 48.0\% on the LIBERO-PRO benchmark. Finally, experiments conducted in the real world using the Galaxea R1 Lite platform confirm its broad applicability across diverse tasks, especially long-horizon tasks. All datasets, checkpoints, and code will be released to the public domain following the acceptance of this work for future research.

AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models

TL;DR

The first subtask-aware VLA framework integrated with a scalable offline post-training pipeline is proposed, which leverages a large language model to decompose high-level demonstrations into fine-grained atomic subtasks and enables highly efficient Group Relative Policy Optimization without the prohibitive expenses associated with online rollouts on physical robots.

Abstract

Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The execution of complex multi-step behaviors in VLA models can be improved by robust instruction grounding, a critical component for effective control. However, current paradigms predominantly rely on coarse, high-level task instructions during supervised fine-tuning. This instruction grounding gap leaves models without explicit intermediate guidance, leading to severe compounding errors in long-horizon tasks. Therefore, bridging this instruction gap and providing scalable post-training for VLA models is urgent. To tackle this problem, we propose \method, the first subtask-aware VLA framework integrated with a scalable offline post-training pipeline. Our framework leverages a large language model to decompose high-level demonstrations into fine-grained atomic subtasks. This approach utilizes a pretrained predictive world model to score candidate action chunks against subtask goals in the latent space, mitigating error accumulation while significantly improving long-horizon robustness. Furthermore, this approach enables highly efficient Group Relative Policy Optimization without the prohibitive expenses associated with online rollouts on physical robots. Extensive simulations validate that our AtomVLA maintains strong robustness under perturbations. When evaluated against fundamental baseline models, it achieves an average success rate of 97.0\% on the LIBERO benchmark and 48.0\% on the LIBERO-PRO benchmark. Finally, experiments conducted in the real world using the Galaxea R1 Lite platform confirm its broad applicability across diverse tasks, especially long-horizon tasks. All datasets, checkpoints, and code will be released to the public domain following the acceptance of this work for future research.
Paper Structure (22 sections, 6 equations, 7 figures, 10 tables)

This paper contains 22 sections, 6 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Framework of AtomVLA. We propose a scalable two-stage framework for robotic manipulation. Left (Stage I): high-level instructions are decomposed into subtask instructions using a Large Language Model (GPT-4o). Subsequently, these subtask instructions are integrated alongside the original high-level instruction as guidance for the SFT training of the model. Middle (Stage II): A predictive latent world model evaluates candidate action rollouts to provide reward for offline post-training via GRPO. Right & Bottom: AtomVLA achieves 97% and 48% success rates on LIBERO and LIBERO-PRO benchmarks and demonstrates strong generalization in real-world.
  • Figure 2: (a) Typical VLA models rely on SFT Training. (b) AtomVLA (Ours) leverages a language model for fine-grained decomposition of atomic subtask instructions and a world model for RL post-training.
  • Figure 3: Training pipeline.Stage I: high-level instructions are decomposed into fine-grained atomic subtask instructions using LLM (GPT-4o). Subsequently, these subtask instructions are integrated with the original high-level instruction to guide the SFT training of the model. Stage II: A predictive latent world model evaluates candidate action rollouts to provide reward for offline post-training.
  • Figure 4: Visualization of real-world tasks. The top two rows illustrate basic tasks to stack bowls, place fruit into a basket, hang the cup, and open the drawer. The bottom two rows demonstrate hard, long-horizon tasks to fold a T-shirt and a towel.
  • Figure 5: Real-world experimental setup. (a) Tabletop workspace. (b) Galaxea R1 lite platform.
  • ...and 2 more figures