Table of Contents
Fetching ...

Learning Keypoints for Robotic Cloth Manipulation using Synthetic Data

Thomas Lips, Victor-Louis De Gusseme, Francis wyffels

TL;DR

This work tackles the generalization gap in robotic cloth manipulation by introducing a synthetic data pipeline to train semantic keypoint detectors on almost-flattened clothes and validating them on a real-world aRTF dataset. It systematically explores procedural cloth mesh generation, random materials, and Nvidia Flex-based deformations to produce diverse training images, achieving a peak mAP of 64.3% with synthetic data and 18 px AKD, and 74.2% mAP and 8.7 px AKD after real-data fine-tuning. A comparative study shows single-layer meshes with random materials offer the best synthetic training results, while highlighting a persistent sim-to-real gap that requires higher fidelity assets (e.g., seams, UV maps) to overcome. The work provides a practical pipeline and dataset for advancing cloth folding research, while outlining pathways to improved realism and interactive perception for future improvements.

Abstract

Assistive robots should be able to wash, fold or iron clothes. However, due to the variety, deformability and self-occlusions of clothes, creating robot systems for cloth manipulation is challenging. Synthetic data is a promising direction to improve generalization, but the sim-to-real gap limits its effectiveness. To advance the use of synthetic data for cloth manipulation tasks such as robotic folding, we present a synthetic data pipeline to train keypoint detectors for almost-flattened cloth items. To evaluate its performance, we have also collected a real-world dataset. We train detectors for both T-shirts, towels and shorts and obtain an average precision of 64% and an average keypoint distance of 18 pixels. Fine-tuning on real-world data improves performance to 74% mAP and an average distance of only 9 pixels. Furthermore, we describe failure modes of the keypoint detectors and compare different approaches to obtain cloth meshes and materials. We also quantify the remaining sim-to-real gap and argue that further improvements to the fidelity of cloth assets will be required to further reduce this gap. The code, dataset and trained models are available

Learning Keypoints for Robotic Cloth Manipulation using Synthetic Data

TL;DR

This work tackles the generalization gap in robotic cloth manipulation by introducing a synthetic data pipeline to train semantic keypoint detectors on almost-flattened clothes and validating them on a real-world aRTF dataset. It systematically explores procedural cloth mesh generation, random materials, and Nvidia Flex-based deformations to produce diverse training images, achieving a peak mAP of 64.3% with synthetic data and 18 px AKD, and 74.2% mAP and 8.7 px AKD after real-data fine-tuning. A comparative study shows single-layer meshes with random materials offer the best synthetic training results, while highlighting a persistent sim-to-real gap that requires higher fidelity assets (e.g., seams, UV maps) to overcome. The work provides a practical pipeline and dataset for advancing cloth folding research, while outlining pathways to improved realism and interactive perception for future improvements.

Abstract

Assistive robots should be able to wash, fold or iron clothes. However, due to the variety, deformability and self-occlusions of clothes, creating robot systems for cloth manipulation is challenging. Synthetic data is a promising direction to improve generalization, but the sim-to-real gap limits its effectiveness. To advance the use of synthetic data for cloth manipulation tasks such as robotic folding, we present a synthetic data pipeline to train keypoint detectors for almost-flattened cloth items. To evaluate its performance, we have also collected a real-world dataset. We train detectors for both T-shirts, towels and shorts and obtain an average precision of 64% and an average keypoint distance of 18 pixels. Fine-tuning on real-world data improves performance to 74% mAP and an average distance of only 9 pixels. Furthermore, we describe failure modes of the keypoint detectors and compare different approaches to obtain cloth meshes and materials. We also quantify the remaining sim-to-real gap and argue that further improvements to the fidelity of cloth assets will be required to further reduce this gap. The code, dataset and trained models are available
Paper Structure (24 sections, 5 figures, 5 tables)

This paper contains 24 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: In this work we learn to detect semantic keypoints on almost-flattened clothes in everyday environments. To tackle the large diversity in cloth states, cloth materials and environments, we generate synthetic data to train these keypoint detectors.
  • Figure 2: Examples of the generated synthetic images. Though these single-layer meshes with random materials look unrealistic, they have the best performance out of all evaluated procedures.
  • Figure 3: Illustration of failure modes of the detectors. From left to right, the columns show ground truth keypoints and all predicted keypoints of detectors trained on real data, synthetic data and on both. The first row illustrates how the real baseline produces incomplete and inconsistent keypoints, whereas the second row shows how the detectors still struggle with folds. In the third row, the detectors are confused by an open zipper and mistake it for a leg of the shorts. The keypoint colors encode their category for each cloth type.
  • Figure 4: Examples of the considered cloth mesh procedures. Left to right: undeformed single-layer mesh, Cloth3d mesh, single-layer mesh. Though less realistic, using the single-layer meshes results in the best performance.
  • Figure 5: Examples of the cloth different materials that were considered. From left to right: uniform colors, a cloth-tailored procedural material and random PolyHaven PolyHaven textures. Though random materials are less plausible, they result in the best performance.