AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots

Likui Zhang; Tao Tang; Zhihao Zhan; Xiuwei Chen; Zisheng Chen; Jianhua Han; Jiangtong Zhu; Pei Xu; Hang Xu; Hefeng Wu; Liang Lin; Xiaodan Liang

AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots

Likui Zhang, Tao Tang, Zhihao Zhan, Xiuwei Chen, Zisheng Chen, Jianhua Han, Jiangtong Zhu, Pei Xu, Hang Xu, Hefeng Wu, Liang Lin, Xiaodan Liang

TL;DR

This work proposes AtomicVLA, a unified planning-and-execution framework that jointly generates task-level plans, atomic skill abstractions, and fine-grained actions, and introduces a flexible routing encoder that automatically assigns dedicated atomic experts to new skills, enabling continual learning.

Abstract

Recent advances in Visual-Language-Action (VLA) models have shown promising potential for robotic manipulation tasks. However, real-world robotic tasks often involve long-horizon, multi-step problem-solving and require generalization for continual skill acquisition, extending beyond single actions or skills. These challenges present significant barriers for existing VLA models, which use monolithic action decoders trained on aggregated data, resulting in poor scalability. To address these challenges, we propose AtomicVLA, a unified planning-and-execution framework that jointly generates task-level plans, atomic skill abstractions, and fine-grained actions. AtomicVLA constructs a scalable atomic skill library through a Skill-Guided Mixture-of-Experts (SG-MoE), where each expert specializes in mastering generic yet precise atomic skills. Furthermore, we introduce a flexible routing encoder that automatically assigns dedicated atomic experts to new skills, enabling continual learning. We validate our approach through extensive experiments. In simulation, AtomicVLA outperforms $π_{0}$ by 2.4\% on LIBERO, 10\% on LIBERO-LONG, and outperforms $π_{0}$ and $π_{0.5}$ by 0.22 and 0.25 in average task length on CALVIN. Additionally, our AtomicVLA consistently surpasses baselines by 18.3\% and 21\% in real-world long-horizon tasks and continual learning. These results highlight the effectiveness of atomic skill abstraction and dynamic expert composition for long-horizon and lifelong robotic tasks. The project page is \href{https://zhanglk9.github.io/atomicvla-web/}{here}.

AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots

TL;DR

Abstract

by 2.4\% on LIBERO, 10\% on LIBERO-LONG, and outperforms

and

by 0.22 and 0.25 in average task length on CALVIN. Additionally, our AtomicVLA consistently surpasses baselines by 18.3\% and 21\% in real-world long-horizon tasks and continual learning. These results highlight the effectiveness of atomic skill abstraction and dynamic expert composition for long-horizon and lifelong robotic tasks. The project page is \href{https://zhanglk9.github.io/atomicvla-web/}{here}.

Paper Structure (27 sections, 3 equations, 11 figures, 10 tables, 1 algorithm)

This paper contains 27 sections, 3 equations, 11 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Vision-Language-Action Models
Multimodal Mixture-of-Experts
Continual Learning with Skill Abstractions
Method
Overview
Unified Task Planning and Action Execution
Skill-guided Mixture of Experts Architecture
Continual Learning with Skill Expansion
Task Planning Embodied Data Generation
Experiments
Experiments Setup
Results on Simulation
Results on Real-world Robot
...and 12 more sections

Figures (11)

Figure 1: Overview of AtomicVLA. Unlike previous VLA models with a single action head, which suffer from limited scalability and severe interference among mixed skills, AtomicVLA employs a SG-MoE architecture to build a scalable skill expert library. By unifying task planning and action execution within this framework, it achieves strong performance on long-horizon and continual learning tasks in both simulation and real-world settings.
Figure 2: (a) AtomicVLA Pipline. AtomicVLA is a framework that unifies task planning and action execution. The VLM adaptively predicts atomic skill abstraction and latent action. Action Decoder in the SG-MoE architecture receives both the latent action and the newly inferred atomic skill abstraction, and generates fine grained motor actions. (b) Skill-Guided Mixture of Experts. SG-MoE includes a skill router, a shared expert, and multiple atomic-skill experts. The router selects the top skill expert based on the atomic skill, and the action token is processed by both the activated skill expert and the shared expert. (c) Continual Learning with Skill Expansion. New skills are added by training only the new expert and extending the router. (d) Task Planning Embodied Data Generation. High-quality embodied reasoning data are generated using principal-axis analysis with InternVideo2.5 wang2025internvideo25empoweringvideomllms model.
Figure 3: Inference Example of AtomicVLA. We visualize two tasks from LIBERO-LONG. For each task, the top row shows the task progression, and the bottom row shows AtomicVLA’s inferred outputs. Gray blocks denote Thinking, while colored blocks indicate Acting, with colors corresponding to the activated skill experts. The left row shows the initial task state (top) and the skill-expert activation during inference (bottom).
Figure 4: Error Recovery Capability Demonstration. When encountering a skill execution failure, AtomicVLA automatically assesses the progress and re-executes the current skill.
Figure 5: Demonstrations show the execution process of AtomicVLA* (second row) and baselines $\pi_{0.5}$ (first row).
...and 6 more figures

AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots

TL;DR

Abstract

AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots

Authors

TL;DR

Abstract

Table of Contents

Figures (11)