Table of Contents
Fetching ...

Learning Semantic Atomic Skills for Multi-Task Robotic Manipulation

Yihang Zhu, Weiqing Wang, Shijie Wu, Ye Shi, Jingya Wang

TL;DR

AtomSkill introduces a semantic atomic-skill framework for multi-task robotic manipulation by segmenting demonstrations into variable-length, semantically coherent skills annotated with a vision-language model. It couples a VQ-VAE style latent skill space with a diffusion-based skill prior and a keypose-imagination-enabled action decoder, enabling both long-horizon planning and fine-grained control. The method uses contrastive learning to align semantic labels and temporal structure, and a keypose-driven chunking mechanism to robustly chain skills during inference. Empirical results on RLBench and real-world multi-task manipulation show consistent improvements over state-of-the-art baselines, highlighting improved cross-task generalization and robust long-horizon execution.

Abstract

While imitation learning has shown impressive results in single-task robot manipulation, scaling it to multi-task settings remains a fundamental challenge due to issues such as suboptimal demonstrations, trajectory noise, and behavioral multi-modality. Existing skill-based methods attempt to address this by decomposing actions into reusable abstractions, but they often rely on fixed-length segmentation or environmental priors that limit semantic consistency and cross-task generalization. In this work, we propose AtomSkill, a novel multi-task imitation learning framework that learns and leverages a structured Atomic Skill Space for composable robot manipulation. Our approach is built on two key technical contributions. First, we construct a Semantically Grounded Atomic Skill Library by partitioning demonstrations into variable-length skills using gripper-state keyframe detection and vision-language model annotation. A contrastive learning objective ensures the resulting skill embeddings are both semantically consistent and temporally coherent. Second, we propose an Action Generation module with Keypose Imagination, which jointly predicts a skill's long-horizon terminal keypose and its immediate action sequence. This enables the policy to reason about overarching motion goals and fine-grained control simultaneously, facilitating robust skill chaining. Extensive experiments in simulated and real-world environments show that AtomSkill consistently outperforms state-of-the-art methods across diverse manipulation tasks.

Learning Semantic Atomic Skills for Multi-Task Robotic Manipulation

TL;DR

AtomSkill introduces a semantic atomic-skill framework for multi-task robotic manipulation by segmenting demonstrations into variable-length, semantically coherent skills annotated with a vision-language model. It couples a VQ-VAE style latent skill space with a diffusion-based skill prior and a keypose-imagination-enabled action decoder, enabling both long-horizon planning and fine-grained control. The method uses contrastive learning to align semantic labels and temporal structure, and a keypose-driven chunking mechanism to robustly chain skills during inference. Empirical results on RLBench and real-world multi-task manipulation show consistent improvements over state-of-the-art baselines, highlighting improved cross-task generalization and robust long-horizon execution.

Abstract

While imitation learning has shown impressive results in single-task robot manipulation, scaling it to multi-task settings remains a fundamental challenge due to issues such as suboptimal demonstrations, trajectory noise, and behavioral multi-modality. Existing skill-based methods attempt to address this by decomposing actions into reusable abstractions, but they often rely on fixed-length segmentation or environmental priors that limit semantic consistency and cross-task generalization. In this work, we propose AtomSkill, a novel multi-task imitation learning framework that learns and leverages a structured Atomic Skill Space for composable robot manipulation. Our approach is built on two key technical contributions. First, we construct a Semantically Grounded Atomic Skill Library by partitioning demonstrations into variable-length skills using gripper-state keyframe detection and vision-language model annotation. A contrastive learning objective ensures the resulting skill embeddings are both semantically consistent and temporally coherent. Second, we propose an Action Generation module with Keypose Imagination, which jointly predicts a skill's long-horizon terminal keypose and its immediate action sequence. This enables the policy to reason about overarching motion goals and fine-grained control simultaneously, facilitating robust skill chaining. Extensive experiments in simulated and real-world environments show that AtomSkill consistently outperforms state-of-the-art methods across diverse manipulation tasks.

Paper Structure

This paper contains 24 sections, 15 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Comparison of skill learning strategies between previous skill-based imitation learning methods and our proposed AtomSkill. (a) Prior methods apply fixed-length sliding windows, resulting in overlapping motion fragments and ambiguous skill boundaries. (b) In contrast, AtomSkill segments demonstrations into semantically coherent and temporally aligned skills.
  • Figure 2: Framework of AtomSkill. The left panel illustrates semantic skill discovery: expert demonstrations of the same task are segmented into semantically coherent, temporally aligned clips, and a vision–language model assigns a skill label to each segment. The top-right panel shows skill learning, where AtomSkill structures the skill space and trains both the skill-guided policy and the diffusion-based sampler. The bottom-right panel depicts inference via action chunking with keypose, enabling smooth and robust chaining of predicted skills.
  • Figure 3: Illustration of selected RLBench tasks and real-world tasks. The six RLBench tasks are categorized into two groups: motion pattern task and spatial localization task. The former tests the ability to reproduce consistent motion dynamics, while the latter examines the accuracy of spatial grounding. The three real-world tasks probe both spatial localization and long-horizon action sequences modeling.
  • Figure 4: The robot setup used in our real-world experiments.
  • Figure 5: t-SNE visualization of latent features of QueST and AtomSkill in the real-world bimanual task setting.
  • ...and 4 more figures