Table of Contents
Fetching ...

CLASP: General-Purpose Clothes Manipulation with Semantic Keypoints

Yuhong Deng, Chao Tang, Cunjun Yu, Linfeng Li, David Hsu

TL;DR

This work tackles general-purpose manipulation of deformable clothes by introducing semantic keypoints as a sparse, language-describable state representation. CLASP integrates a two-stage semantic keypoint extraction/matching pipeline with a vision-language-model–driven task planner and a pre-built skill library to execute grounded actions, all within a closed-loop framework. Across simulation and real-robot experiments, CLASP demonstrates superior generalization across diverse clothes types and tasks, outperforming baselines and achieving high success in folding, flattening, hanging, and placing. The approach promises practical impact for home-service robotics by enabling robust, adaptable manipulation of varied garments without extensive task-specific engineering.

Abstract

Clothes manipulation, such as folding or hanging, is a critical capability for home service robots. Despite recent advances, most existing methods remain limited to specific clothes types and tasks, due to the complex, high-dimensional geometry of clothes. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP), which aims at general-purpose clothes manipulation over diverse clothes types, T-shirts, shorts, skirts, long dresses, ..., as well as different tasks, folding, flattening, hanging, .... The core idea of CLASP is semantic keypoints-e.g., ''left sleeve'' and ''right shoulder''-a sparse spatial-semantic representation, salient for both perception and action. Semantic keypoints of clothes can be reliably extracted from RGB-D images and provide an effective representation for a wide range of clothes manipulation policies. CLASP uses semantic keypoints as an intermediate representation to connect high-level task planning and low-level action execution. At the high level, it exploits vision language models (VLMs) to predict task plans over the semantic keypoints. At the low level, it executes the plans with the help of a set of pre-built manipulation skills conditioned on the keypoints. Extensive simulation experiments show that CLASP outperforms state-of-the-art baseline methods on multiple tasks across diverse clothes types, demonstrating strong performance and generalization. Further experiments with a Franka dual-arm system on four distinct tasks-folding, flattening, hanging, and placing-confirm CLASP's performance on real-life clothes manipulation.

CLASP: General-Purpose Clothes Manipulation with Semantic Keypoints

TL;DR

This work tackles general-purpose manipulation of deformable clothes by introducing semantic keypoints as a sparse, language-describable state representation. CLASP integrates a two-stage semantic keypoint extraction/matching pipeline with a vision-language-model–driven task planner and a pre-built skill library to execute grounded actions, all within a closed-loop framework. Across simulation and real-robot experiments, CLASP demonstrates superior generalization across diverse clothes types and tasks, outperforming baselines and achieving high success in folding, flattening, hanging, and placing. The approach promises practical impact for home-service robotics by enabling robust, adaptable manipulation of varied garments without extensive task-specific engineering.

Abstract

Clothes manipulation, such as folding or hanging, is a critical capability for home service robots. Despite recent advances, most existing methods remain limited to specific clothes types and tasks, due to the complex, high-dimensional geometry of clothes. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP), which aims at general-purpose clothes manipulation over diverse clothes types, T-shirts, shorts, skirts, long dresses, ..., as well as different tasks, folding, flattening, hanging, .... The core idea of CLASP is semantic keypoints-e.g., ''left sleeve'' and ''right shoulder''-a sparse spatial-semantic representation, salient for both perception and action. Semantic keypoints of clothes can be reliably extracted from RGB-D images and provide an effective representation for a wide range of clothes manipulation policies. CLASP uses semantic keypoints as an intermediate representation to connect high-level task planning and low-level action execution. At the high level, it exploits vision language models (VLMs) to predict task plans over the semantic keypoints. At the low level, it executes the plans with the help of a set of pre-built manipulation skills conditioned on the keypoints. Extensive simulation experiments show that CLASP outperforms state-of-the-art baseline methods on multiple tasks across diverse clothes types, demonstrating strong performance and generalization. Further experiments with a Franka dual-arm system on four distinct tasks-folding, flattening, hanging, and placing-confirm CLASP's performance on real-life clothes manipulation.

Paper Structure

This paper contains 25 sections, 1 equation, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: CLothes mAnipulation with Semantic keyPoints (CLASP). The semantic keypoint representation enables CLASP to generalize over many different types of clothes and tasks. (a) Semantic keypoints for various types of clothes. (b) Four distinct clothes manipulation tasks in our experiments.
  • Figure 2: CLASP Overview. Given an RGB-D observation, CLASP extracts semantic keypoints. These keypoints, along with the RGB image and task instruction, are fed to a VLM to generate a task plan. Once the plan is verified for feasibility through motion planning, the subtasks are executed sequentially. After each subtask execution, CLASP updates the observation and decides whether to replan. This process repeats until the overall task is completed.
  • Figure 3: Semantic keypoint discovery. A fully-automated pipeline discovers semantic keypoints on a prototype image for each clothes type.
  • Figure 4: Semantic keypoint matching. Given an image of novel clothes, we first retrieve the most relevant prototype. Each semantic keypoint on the prototype is then matched to the novel clothes through a coarse-to-fine pipeline. Specifically, a VLM is employed for coarse region matching, while Stable Diffusion (SD) and DINOv2 are utilized for fine-grained keypoint matching.
  • Figure 5: Skill library. CLASP skill library consists of 5 basic skills: grasp, moveto, release, rotate and pull.
  • ...and 4 more figures