Learning Semantic Atomic Skills for Multi-Task Robotic Manipulation
Yihang Zhu, Weiqing Wang, Shijie Wu, Ye Shi, Jingya Wang
TL;DR
AtomSkill introduces a semantic atomic-skill framework for multi-task robotic manipulation by segmenting demonstrations into variable-length, semantically coherent skills annotated with a vision-language model. It couples a VQ-VAE style latent skill space with a diffusion-based skill prior and a keypose-imagination-enabled action decoder, enabling both long-horizon planning and fine-grained control. The method uses contrastive learning to align semantic labels and temporal structure, and a keypose-driven chunking mechanism to robustly chain skills during inference. Empirical results on RLBench and real-world multi-task manipulation show consistent improvements over state-of-the-art baselines, highlighting improved cross-task generalization and robust long-horizon execution.
Abstract
While imitation learning has shown impressive results in single-task robot manipulation, scaling it to multi-task settings remains a fundamental challenge due to issues such as suboptimal demonstrations, trajectory noise, and behavioral multi-modality. Existing skill-based methods attempt to address this by decomposing actions into reusable abstractions, but they often rely on fixed-length segmentation or environmental priors that limit semantic consistency and cross-task generalization. In this work, we propose AtomSkill, a novel multi-task imitation learning framework that learns and leverages a structured Atomic Skill Space for composable robot manipulation. Our approach is built on two key technical contributions. First, we construct a Semantically Grounded Atomic Skill Library by partitioning demonstrations into variable-length skills using gripper-state keyframe detection and vision-language model annotation. A contrastive learning objective ensures the resulting skill embeddings are both semantically consistent and temporally coherent. Second, we propose an Action Generation module with Keypose Imagination, which jointly predicts a skill's long-horizon terminal keypose and its immediate action sequence. This enables the policy to reason about overarching motion goals and fine-grained control simultaneously, facilitating robust skill chaining. Extensive experiments in simulated and real-world environments show that AtomSkill consistently outperforms state-of-the-art methods across diverse manipulation tasks.
