SculptDiff: Learning Robotic Clay Sculpting from Humans with Goal Conditioned Diffusion Policy

Alison Bartsch; Arvind Car; Charlotte Avra; Amir Barati Farimani

SculptDiff: Learning Robotic Clay Sculpting from Humans with Goal Conditioned Diffusion Policy

Alison Bartsch, Arvind Car, Charlotte Avra, Amir Barati Farimani

TL;DR

This work proposes SculptDiff, a goal-conditioned diffusion-based imitation learning framework that works with point cloud state observations to directly learn clay sculpting policies for a variety of target shapes, and is the first real-world method that successfully learns manipulation policies for 3D deformable objects.

Abstract

Manipulating deformable objects remains a challenge within robotics due to the difficulties of state estimation, long-horizon planning, and predicting how the object will deform given an interaction. These challenges are the most pronounced with 3D deformable objects. We propose SculptDiff, a goal-conditioned diffusion-based imitation learning framework that works with point cloud state observations to directly learn clay sculpting policies for a variety of target shapes. To the best of our knowledge this is the first real-world method that successfully learns manipulation policies for 3D deformable objects. For sculpting videos and access to our dataset and hardware CAD models, see the project website: https://sites.google.com/andrew.cmu.edu/imitation-sculpting/home

SculptDiff: Learning Robotic Clay Sculpting from Humans with Goal Conditioned Diffusion Policy

TL;DR

Abstract

Paper Structure (12 sections, 2 equations, 5 figures, 2 tables)

This paper contains 12 sections, 2 equations, 5 figures, 2 tables.

INTRODUCTION
RELATED WORK
METHOD
Clay Sculpting Task
Point Cloud State Representation
SculptDiff: Point Cloud Diffusion Policy
EXPERIMENTS AND RESULTS
How does the point cloud input influence sculpting performance compared to an image input?
How do the policies trained on multiple shape goals and those trained on individual goals compare in behaviors?
How does the point cloud embedding change when finetuning PointBERT end-to-end with different policies?
How does our system compare to human performance?
CONCLUSION

Figures (5)

Figure 1: We present a goal-conditioned imitation learning framework for sculpting clay that uses point-cloud state representation. We find that our system is much faster at test time compared to traditional methods planning with a dynamics model. However, our system is more limited in scope as imitation learning does not generalize well to unseen goals.
Figure 2: The pipeline of SculptDiff. The state and goal point clouds are encoded with PointBERT yu2022point and a linear projection head to create a latent conditioning observation along with the previous action executed by the robot. The latent state and goal observations as well as the previous action are the conditioning information used to condition the denoising diffusion process for diffusion policy chi2023 to generate the predicted action sequence.
Figure 3: The experimental setup includes 4 Intel RealSense D415 RGB-D cameras mounted to a camera cage to reconstruct the clay point cloud. An additional Intel RealSense D455 overhead camera is used to record experimental videos. We fit the robot with 3D printed fingertips and an elevated stage similar to those in bartsch2023. We assume the clay always remains centered and fixed to the elevated stage throughout the experiments.
Figure 4: The TSNE embeddings for PointBERT with different training strategies on the X demonstration dataset. The colorbar of state index indicates the state number ranging from 0 to 7 for the states in each demonstration trajectory for the X shape. a) PointBERT pre-trained on ShapeNet, b) PointBERT finetuned with diffusion policy, c) PointBERT finetuned with ACT policy, and d) PointBERT finetuned with VINN policy.
Figure 5: The final shapes created by the policies trained with point cloud inputs for a single shape goal. For the target point cloud (representing the shapes created by the human oracle using hands) on the left-most column, the lightness of each point is correlated with the point's z-value to visualize depth. While both human oracles create the best shapes, point cloud diffusion policy is able to successfully create the closest matches to the human demonstrations.

SculptDiff: Learning Robotic Clay Sculpting from Humans with Goal Conditioned Diffusion Policy

TL;DR

Abstract

SculptDiff: Learning Robotic Clay Sculpting from Humans with Goal Conditioned Diffusion Policy

Authors

TL;DR

Abstract

Table of Contents

Figures (5)