Table of Contents
Fetching ...

RaggeDi: Diffusion-based State Estimation of Disordered Rags, Sheets, Towels and Blankets

Jikai Ye, Wanze Li, Shiraz Khan, Gregory S. Chirikjian

TL;DR

This work presents RaggeDi, a diffusion-based framework for estimating the full state of deformable cloth from a single RGB-D image. By representing the cloth state as a translation map $\boldsymbol{\tau} \in \mathbb{R}^{H\times W\times 3}$ that aligns a canonical flattened mesh to its deformed configuration, RaggeDi recasts cloth state estimation as conditional image generation solved with a DDPM conditioned on depth. The approach demonstrates strong accuracy and speed in simulation and shows zero-shot sim-to-real transfer in real-world tests, with optional refinement via point-cloud registration. This method enables robust, real-time manipulation of cloth in robotic tasks such as dressing, folding, and covering, and points to future work on more complex topologies and mesh-level diffusion.

Abstract

Cloth state estimation is an important problem in robotics. It is essential for the robot to know the accurate state to manipulate cloth and execute tasks such as robotic dressing, stitching, and covering/uncovering human beings. However, estimating cloth state accurately remains challenging due to its high flexibility and self-occlusion. This paper proposes a diffusion model-based pipeline that formulates the cloth state estimation as an image generation problem by representing the cloth state as an RGB image that describes the point-wise translation (translation map) between a pre-defined flattened mesh and the deformed mesh in a canonical space. Then we train a conditional diffusion-based image generation model to predict the translation map based on an observation. Experiments are conducted in both simulation and the real world to validate the performance of our method. Results indicate that our method outperforms two recent methods in both accuracy and speed.

RaggeDi: Diffusion-based State Estimation of Disordered Rags, Sheets, Towels and Blankets

TL;DR

This work presents RaggeDi, a diffusion-based framework for estimating the full state of deformable cloth from a single RGB-D image. By representing the cloth state as a translation map that aligns a canonical flattened mesh to its deformed configuration, RaggeDi recasts cloth state estimation as conditional image generation solved with a DDPM conditioned on depth. The approach demonstrates strong accuracy and speed in simulation and shows zero-shot sim-to-real transfer in real-world tests, with optional refinement via point-cloud registration. This method enables robust, real-time manipulation of cloth in robotic tasks such as dressing, folding, and covering, and points to future work on more complex topologies and mesh-level diffusion.

Abstract

Cloth state estimation is an important problem in robotics. It is essential for the robot to know the accurate state to manipulate cloth and execute tasks such as robotic dressing, stitching, and covering/uncovering human beings. However, estimating cloth state accurately remains challenging due to its high flexibility and self-occlusion. This paper proposes a diffusion model-based pipeline that formulates the cloth state estimation as an image generation problem by representing the cloth state as an RGB image that describes the point-wise translation (translation map) between a pre-defined flattened mesh and the deformed mesh in a canonical space. Then we train a conditional diffusion-based image generation model to predict the translation map based on an observation. Experiments are conducted in both simulation and the real world to validate the performance of our method. Results indicate that our method outperforms two recent methods in both accuracy and speed.
Paper Structure (17 sections, 7 equations, 5 figures, 1 table)

This paper contains 17 sections, 7 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: (a) The workflow of the proposed method. (b) The detailed structure of the translation map generation model. It contains three main components: a MLP time step encoder, a ResNet-based observation encoder and a noise prediction network that uses CNN as backbone and the U-Net as the main structure. (c) An example of the mesh predicted by our method.
  • Figure 2: (a) Processed depth image $\boldsymbol{o}_d$ used as the input to the diffusion model. (b) The output translation map $\boldsymbol{\tau}$. (c) A translation vector field that is equivalent to the translation map. (d) Reconstructed predicted point cloud in canonical space using $\boldsymbol{\tau}$.
  • Figure 3: Some examples of data collected in the simulation environment. The first column is the state of the flattened cloth.
  • Figure 4: Examples of cloth state predicted with different methods in the simulation experiments. Four columns on the left show the mesh, point cloud, translation map, and RGB images rendered from the mesh of a relatively simple cloth state. Four columns on the right show a complex state.
  • Figure 5: Real-world experiment results are shown above. The left column is the observed RGB image. The middle column is the vertices predicted by RaggeDi. Post-processing transforms the mesh into the image space. The right column shows the visualized mesh.