Table of Contents
Fetching ...

MuST: Multi-Head Skill Transformer for Long-Horizon Dexterous Manipulation with Skill Progress

Kai Gao, Fan Wang, Erica Aduh, Dylan Randle, Jane Shi

TL;DR

MuST addresses long-horizon dexterous manipulation by decomposing tasks into a finite skill set S and learning per-skill policies on a shared transformer backbone. Each skill i outputs actions P_t^{(i)} and a progress ρ_t^{(i)} ∈ [0,1], with termination θ_i guiding transitions; a progress-guided selector ProGSS chooses the next skill based on these progress signals. The architecture enables simultaneous training of multiple skill heads and scalable expansion by adding new heads and updating the progress head. Empirical results in simulation and on real robots show substantial improvements over the Octo single-policy baseline, e.g., increasing task completion from 32.5% to about 90% and reducing execution time by up to ~38%, and demonstrating robustness to disturbances and diverse object sets.

Abstract

Robot picking and packing tasks require dexterous manipulation skills, such as rearranging objects to establish a good grasping pose, or placing and pushing items to achieve tight packing. These tasks are challenging for robots due to the complexity and variability of the required actions. To tackle the difficulty of learning and executing long-horizon tasks, we propose a novel framework called the Multi-Head Skill Transformer (MuST). This model is designed to learn and sequentially chain together multiple motion primitives (skills), enabling robots to perform complex sequences of actions effectively. MuST introduces a "progress value" for each skill, guiding the robot on which skill to execute next and ensuring smooth transitions between skills. Additionally, our model is capable of expanding its skill set and managing various sequences of sub-tasks efficiently. Extensive experiments in both simulated and real-world environments demonstrate that MuST significantly enhances the robot's ability to perform long-horizon dexterous manipulation tasks.

MuST: Multi-Head Skill Transformer for Long-Horizon Dexterous Manipulation with Skill Progress

TL;DR

MuST addresses long-horizon dexterous manipulation by decomposing tasks into a finite skill set S and learning per-skill policies on a shared transformer backbone. Each skill i outputs actions P_t^{(i)} and a progress ρ_t^{(i)} ∈ [0,1], with termination θ_i guiding transitions; a progress-guided selector ProGSS chooses the next skill based on these progress signals. The architecture enables simultaneous training of multiple skill heads and scalable expansion by adding new heads and updating the progress head. Empirical results in simulation and on real robots show substantial improvements over the Octo single-policy baseline, e.g., increasing task completion from 32.5% to about 90% and reducing execution time by up to ~38%, and demonstrating robustness to disturbances and diverse object sets.

Abstract

Robot picking and packing tasks require dexterous manipulation skills, such as rearranging objects to establish a good grasping pose, or placing and pushing items to achieve tight packing. These tasks are challenging for robots due to the complexity and variability of the required actions. To tackle the difficulty of learning and executing long-horizon tasks, we propose a novel framework called the Multi-Head Skill Transformer (MuST). This model is designed to learn and sequentially chain together multiple motion primitives (skills), enabling robots to perform complex sequences of actions effectively. MuST introduces a "progress value" for each skill, guiding the robot on which skill to execute next and ensuring smooth transitions between skills. Additionally, our model is capable of expanding its skill set and managing various sequences of sub-tasks efficiently. Extensive experiments in both simulated and real-world environments demonstrate that MuST significantly enhances the robot's ability to perform long-horizon dexterous manipulation tasks.

Paper Structure

This paper contains 20 sections, 4 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: [Top] An example of long-horizon dexterous manipulation. The robot executes four skills to manipulate an object from the boundary of the picking tote to the corner of the packing tote. [Bottom] Our proposed imitation learning model MuST with N-skill and skill selector ProGSS.
  • Figure 2: Overview of MuST(Multi-Head Skill Transformer). The model consists of a pre-trained Octo transformer backboneteam2024octo and $N+1$ heads for an $N-$skill set. Each of the skill head computes an action sequence of its skill.The progress head ProGSS, the skill selector, estimates the progress of the entire skill set.
  • Figure 3: Annotation of skill progress in a skill-related episode segment.
  • Figure 4: An example of ProGSS with a single skill sequence.
  • Figure 5: We use either language prompts or images as goal state indicators to customize packing poses.
  • ...and 5 more figures