General-purpose Clothes Manipulation with Semantic Keypoints

Yuhong Deng; David Hsu

General-purpose Clothes Manipulation with Semantic Keypoints

Yuhong Deng, David Hsu

TL;DR

This work tackles general-purpose clothes manipulation by introducing CLASP, a hierarchical framework that uses semantic keypoints (language descriptors plus 2-D positions) to represent clothing state. Semantic keypoints are detected with a masked autoencoder-based spatiotemporal model trained with reconstruction loss $L_r$ and keypoint loss $L_{kp}$, while an LLM provides task planning by decomposing language instructions into sub-tasks described by action primitives and keypoint-based contacts. A low-level action primitives library then generates trajectories grounded at semantic keypoints to execute the sub-tasks. Experiments in SoftGym (across 30 tasks and four clothing categories) and real Kinova dual-arm trials demonstrate superior generalization to unseen tasks and successful sim-to-real transfer, highlighting the effectiveness of semantic keypoints as a general-purpose cue for deformable object manipulation.

Abstract

Clothes manipulation is a critical capability for household robots; yet, existing methods are often confined to specific tasks, such as folding or flattening, due to the complex high-dimensional geometry of deformable fabric. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP) for general-purpose clothes manipulation, which enables the robot to perform diverse manipulation tasks over different types of clothes. The key idea of CLASP is semantic keypoints -- e.g., "right shoulder", "left sleeve", etc. -- a sparse spatial-semantic representation that is salient for both perception and action. Semantic keypoints of clothes can be effectively extracted from depth images and are sufficient to represent a broad range of clothes manipulation policies. CLASP leverages semantic keypoints to bridge LLM-powered task planning and low-level action execution in a two-level hierarchy. Extensive simulation experiments show that CLASP outperforms baseline methods across diverse clothes types in both seen and unseen tasks. Further, experiments with a Kinova dual-arm system on four distinct tasks -- folding, flattening, hanging, and placing -- confirm CLASP's performance on a real robot.

General-purpose Clothes Manipulation with Semantic Keypoints

TL;DR

and keypoint loss

, while an LLM provides task planning by decomposing language instructions into sub-tasks described by action primitives and keypoint-based contacts. A low-level action primitives library then generates trajectories grounded at semantic keypoints to execute the sub-tasks. Experiments in SoftGym (across 30 tasks and four clothing categories) and real Kinova dual-arm trials demonstrate superior generalization to unseen tasks and successful sim-to-real transfer, highlighting the effectiveness of semantic keypoints as a general-purpose cue for deformable object manipulation.

Abstract

Paper Structure (14 sections, 5 figures, 2 tables)

This paper contains 14 sections, 5 figures, 2 tables.

Introduction
Related Work
Deformable Object Manipulation
Language-conditioned Object Manipulation
State Representation of Deformable Object
Method
Semantic Keypoint Detection
Task Planning
Action Execution
Experiments
Semantic Keypoint Detection Experiments
Simulation Experiments
Real-Robot Experiments
Conclusion

Figures (5)

Figure 1: General-purpose clothes manipulation. CLASP performs various manipulation tasks over different types of clothes.
Figure 2: An overview of CLASP. Given the natural-language task instruction and a depth image, CLASP first detects semantic keypoints, each consisting of a language descriptor and a 2-D geometric position. The task instruction and the language descriptors are fed into an LLM to generate a sequence of sub-tasks on the keypoints. For each sub-task, a low-level action primitives library generates the action on the keypoint position.
Figure 3: An ablation study on the effects of masking and temporal information on semantic keypoint detection.
Figure 4: Setup for real-robot experiments. The dual-arm system consists of an Intel RealSense camera for depth sensing and two Kinova Mico arms.
Figure 5: Real-robot experiments. (a) Four clothes manipulation tasks: folding, flattening, hanging, and placing. (b) Semantic keypoint detection on a variety of clothes.

General-purpose Clothes Manipulation with Semantic Keypoints

TL;DR

Abstract

General-purpose Clothes Manipulation with Semantic Keypoints

Authors

TL;DR

Abstract

Table of Contents

Figures (5)