Table of Contents
Fetching ...

Flat'n'Fold: A Diverse Multi-Modal Dataset for Garment Perception and Manipulation

Lipeng Zhuang, Shiyu Fan, Yingdong Ru, Florent Audonnet, Paul Henderson, Gerardo Aragon-Camarasa

TL;DR

This work presents Flat'n'Fold, a novel large-scale dataset for garment manipulation that addresses critical gaps in existing datasets, and establishes new benchmarks for grasping point prediction and subtask decomposition.

Abstract

We present Flat'n'Fold, a novel large-scale dataset for garment manipulation that addresses critical gaps in existing datasets. Comprising 1,212 human and 887 robot demonstrations of flattening and folding 44 unique garments across 8 categories, Flat'n'Fold surpasses prior datasets in size, scope, and diversity. Our dataset uniquely captures the entire manipulation process from crumpled to folded states, providing synchronized multi-view RGB-D images, point clouds, and action data, including hand or gripper positions and rotations. We quantify the dataset's diversity and complexity compared to existing benchmarks and show that our dataset features natural and diverse manipulations of real-world demonstrations of human and robot demonstrations in terms of visual and action information. To showcase Flat'n'Fold's utility, we establish new benchmarks for grasping point prediction and subtask decomposition. Our evaluation of state-of-the-art models on these tasks reveals significant room for improvement. This underscores Flat'n'Fold's potential to drive advances in robotic perception and manipulation of deformable objects. Our dataset can be downloaded at https://cvas-ug.github.io/flat-n-fold

Flat'n'Fold: A Diverse Multi-Modal Dataset for Garment Perception and Manipulation

TL;DR

This work presents Flat'n'Fold, a novel large-scale dataset for garment manipulation that addresses critical gaps in existing datasets, and establishes new benchmarks for grasping point prediction and subtask decomposition.

Abstract

We present Flat'n'Fold, a novel large-scale dataset for garment manipulation that addresses critical gaps in existing datasets. Comprising 1,212 human and 887 robot demonstrations of flattening and folding 44 unique garments across 8 categories, Flat'n'Fold surpasses prior datasets in size, scope, and diversity. Our dataset uniquely captures the entire manipulation process from crumpled to folded states, providing synchronized multi-view RGB-D images, point clouds, and action data, including hand or gripper positions and rotations. We quantify the dataset's diversity and complexity compared to existing benchmarks and show that our dataset features natural and diverse manipulations of real-world demonstrations of human and robot demonstrations in terms of visual and action information. To showcase Flat'n'Fold's utility, we establish new benchmarks for grasping point prediction and subtask decomposition. Our evaluation of state-of-the-art models on these tasks reveals significant room for improvement. This underscores Flat'n'Fold's potential to drive advances in robotic perception and manipulation of deformable objects. Our dataset can be downloaded at https://cvas-ug.github.io/flat-n-fold
Paper Structure (15 sections, 3 figures, 8 tables)

This paper contains 15 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Human and robot demonstrations: Each demonstrates the progression from a crumpled garment configuration to a flattened and folded state.
  • Figure 2: Hardware Setup. (1) Front camera; (2) Top camera; (3) Side camera; (4) Steam Index VR Headset valveindex, which serves as the origin of the world; (5) HTC Vive tracker; (6) Receiver of the tracker; (7) Pedal; (8) Grasping Point, the yellow line indicates the distance from the center of the tracker to the grasping point; (9) Baxter's gripper with (A) the gripper in its closed state and (B) opened state; (10) Baxter's zero-G mode armcontrolsystem and control buttons. The black numbers mean that this hardware was used for human and robot demonstrations; red numbers, it was used in the robot demonstration dataset only, and green numbers, it was used only for human demonstration.
  • Figure 3: Example of subtask decomposition with Flat'n'Fold using UVD zhang2024universal alongside ground truth comparisons. Green blocks represent accurately predicted subtasks by UVD, while red blocks indicate ground truth subtask images that UVD failed to identify.