Planning and Reasoning with 3D Deformable Objects for Hierarchical Text-to-3D Robotic Shaping
Alison Bartsch, Amir Barati Farimani
TL;DR
The paper tackles autonomous manipulation and shaping of 3D deformable objects (clay) without explicit 3D goals. It proposes a coarse-to-fine pipeline that first places discrete clay segments to form a coarse shape and then refines it with deformation actions, guided by LLM-based sub-goal generation and a region-based PointNet action model. The system integrates additive placement with deformation-based sculpting in a true text-to-3D pipeline and is evaluated through real-world experiments, CLIP analyses, and human surveys, revealing limitations of CLIP as a sole metric for this task. Key contributions include the first autonomous text-to-3D sculpture pipeline, a cluster-based action model that generalizes across shapes, and demonstration of semantically tunable outputs, with future work focused on smoother finishing and robust quantitative evaluation metrics.
Abstract
Deformable object manipulation remains a key challenge in developing autonomous robotic systems that can be successfully deployed in real-world scenarios. In this work, we explore the challenges of deformable object manipulation through the task of sculpting clay into 3D shapes. We propose the first coarse-to-fine autonomous sculpting system in which the sculpting agent first selects how many and where to place discrete chunks of clay into the workspace to create a coarse shape, and then iteratively refines the shape with sequences of deformation actions. We leverage large language models for sub-goal generation, and train a point cloud region-based action model to predict robot actions from the desired point cloud sub-goals. Additionally, our method is the first autonomous sculpting system that is a real-world text-to-3D shaping pipeline without any explicit 3D goals or sub-goals provided to the system. We demonstrate our method is able to successfully create a set of simple shapes solely from text-based prompting. Furthermore, we explore rigorously how to best quantify success for the text-to-3D sculpting task, and compare existing text-image and text-point cloud similarity metrics to human evaluations for this task. For experimental videos, human evaluation details, and full prompts, please see our project website: https://sites.google.com/andrew.cmu.edu/hierarchicalsculpting
