CLASP: General-Purpose Clothes Manipulation with Semantic Keypoints

Yuhong Deng; Chao Tang; Cunjun Yu; Linfeng Li; David Hsu

CLASP: General-Purpose Clothes Manipulation with Semantic Keypoints

Yuhong Deng, Chao Tang, Cunjun Yu, Linfeng Li, David Hsu

TL;DR

This work tackles general-purpose manipulation of deformable clothes by introducing semantic keypoints as a sparse, language-describable state representation. CLASP integrates a two-stage semantic keypoint extraction/matching pipeline with a vision-language-model–driven task planner and a pre-built skill library to execute grounded actions, all within a closed-loop framework. Across simulation and real-robot experiments, CLASP demonstrates superior generalization across diverse clothes types and tasks, outperforming baselines and achieving high success in folding, flattening, hanging, and placing. The approach promises practical impact for home-service robotics by enabling robust, adaptable manipulation of varied garments without extensive task-specific engineering.

Abstract

Clothes manipulation, such as folding or hanging, is a critical capability for home service robots. Despite recent advances, most existing methods remain limited to specific clothes types and tasks, due to the complex, high-dimensional geometry of clothes. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP), which aims at general-purpose clothes manipulation over diverse clothes types, T-shirts, shorts, skirts, long dresses, ..., as well as different tasks, folding, flattening, hanging, .... The core idea of CLASP is semantic keypoints-e.g., ''left sleeve'' and ''right shoulder''-a sparse spatial-semantic representation, salient for both perception and action. Semantic keypoints of clothes can be reliably extracted from RGB-D images and provide an effective representation for a wide range of clothes manipulation policies. CLASP uses semantic keypoints as an intermediate representation to connect high-level task planning and low-level action execution. At the high level, it exploits vision language models (VLMs) to predict task plans over the semantic keypoints. At the low level, it executes the plans with the help of a set of pre-built manipulation skills conditioned on the keypoints. Extensive simulation experiments show that CLASP outperforms state-of-the-art baseline methods on multiple tasks across diverse clothes types, demonstrating strong performance and generalization. Further experiments with a Franka dual-arm system on four distinct tasks-folding, flattening, hanging, and placing-confirm CLASP's performance on real-life clothes manipulation.

CLASP: General-Purpose Clothes Manipulation with Semantic Keypoints

TL;DR

Abstract

CLASP: General-Purpose Clothes Manipulation with Semantic Keypoints

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)