General-purpose Clothes Manipulation with Semantic Keypoints
Yuhong Deng, David Hsu
TL;DR
This work tackles general-purpose clothes manipulation by introducing CLASP, a hierarchical framework that uses semantic keypoints (language descriptors plus 2-D positions) to represent clothing state. Semantic keypoints are detected with a masked autoencoder-based spatiotemporal model trained with reconstruction loss $L_r$ and keypoint loss $L_{kp}$, while an LLM provides task planning by decomposing language instructions into sub-tasks described by action primitives and keypoint-based contacts. A low-level action primitives library then generates trajectories grounded at semantic keypoints to execute the sub-tasks. Experiments in SoftGym (across 30 tasks and four clothing categories) and real Kinova dual-arm trials demonstrate superior generalization to unseen tasks and successful sim-to-real transfer, highlighting the effectiveness of semantic keypoints as a general-purpose cue for deformable object manipulation.
Abstract
Clothes manipulation is a critical capability for household robots; yet, existing methods are often confined to specific tasks, such as folding or flattening, due to the complex high-dimensional geometry of deformable fabric. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP) for general-purpose clothes manipulation, which enables the robot to perform diverse manipulation tasks over different types of clothes. The key idea of CLASP is semantic keypoints -- e.g., "right shoulder", "left sleeve", etc. -- a sparse spatial-semantic representation that is salient for both perception and action. Semantic keypoints of clothes can be effectively extracted from depth images and are sufficient to represent a broad range of clothes manipulation policies. CLASP leverages semantic keypoints to bridge LLM-powered task planning and low-level action execution in a two-level hierarchy. Extensive simulation experiments show that CLASP outperforms baseline methods across diverse clothes types in both seen and unseen tasks. Further, experiments with a Kinova dual-arm system on four distinct tasks -- folding, flattening, hanging, and placing -- confirm CLASP's performance on a real robot.
