MuST: Multi-Head Skill Transformer for Long-Horizon Dexterous Manipulation with Skill Progress
Kai Gao, Fan Wang, Erica Aduh, Dylan Randle, Jane Shi
TL;DR
MuST addresses long-horizon dexterous manipulation by decomposing tasks into a finite skill set S and learning per-skill policies on a shared transformer backbone. Each skill i outputs actions P_t^{(i)} and a progress ρ_t^{(i)} ∈ [0,1], with termination θ_i guiding transitions; a progress-guided selector ProGSS chooses the next skill based on these progress signals. The architecture enables simultaneous training of multiple skill heads and scalable expansion by adding new heads and updating the progress head. Empirical results in simulation and on real robots show substantial improvements over the Octo single-policy baseline, e.g., increasing task completion from 32.5% to about 90% and reducing execution time by up to ~38%, and demonstrating robustness to disturbances and diverse object sets.
Abstract
Robot picking and packing tasks require dexterous manipulation skills, such as rearranging objects to establish a good grasping pose, or placing and pushing items to achieve tight packing. These tasks are challenging for robots due to the complexity and variability of the required actions. To tackle the difficulty of learning and executing long-horizon tasks, we propose a novel framework called the Multi-Head Skill Transformer (MuST). This model is designed to learn and sequentially chain together multiple motion primitives (skills), enabling robots to perform complex sequences of actions effectively. MuST introduces a "progress value" for each skill, guiding the robot on which skill to execute next and ensuring smooth transitions between skills. Additionally, our model is capable of expanding its skill set and managing various sequences of sub-tasks efficiently. Extensive experiments in both simulated and real-world environments demonstrate that MuST significantly enhances the robot's ability to perform long-horizon dexterous manipulation tasks.
