Table of Contents
Fetching ...

Planning and Reasoning with 3D Deformable Objects for Hierarchical Text-to-3D Robotic Shaping

Alison Bartsch, Amir Barati Farimani

TL;DR

The paper tackles autonomous manipulation and shaping of 3D deformable objects (clay) without explicit 3D goals. It proposes a coarse-to-fine pipeline that first places discrete clay segments to form a coarse shape and then refines it with deformation actions, guided by LLM-based sub-goal generation and a region-based PointNet action model. The system integrates additive placement with deformation-based sculpting in a true text-to-3D pipeline and is evaluated through real-world experiments, CLIP analyses, and human surveys, revealing limitations of CLIP as a sole metric for this task. Key contributions include the first autonomous text-to-3D sculpture pipeline, a cluster-based action model that generalizes across shapes, and demonstration of semantically tunable outputs, with future work focused on smoother finishing and robust quantitative evaluation metrics.

Abstract

Deformable object manipulation remains a key challenge in developing autonomous robotic systems that can be successfully deployed in real-world scenarios. In this work, we explore the challenges of deformable object manipulation through the task of sculpting clay into 3D shapes. We propose the first coarse-to-fine autonomous sculpting system in which the sculpting agent first selects how many and where to place discrete chunks of clay into the workspace to create a coarse shape, and then iteratively refines the shape with sequences of deformation actions. We leverage large language models for sub-goal generation, and train a point cloud region-based action model to predict robot actions from the desired point cloud sub-goals. Additionally, our method is the first autonomous sculpting system that is a real-world text-to-3D shaping pipeline without any explicit 3D goals or sub-goals provided to the system. We demonstrate our method is able to successfully create a set of simple shapes solely from text-based prompting. Furthermore, we explore rigorously how to best quantify success for the text-to-3D sculpting task, and compare existing text-image and text-point cloud similarity metrics to human evaluations for this task. For experimental videos, human evaluation details, and full prompts, please see our project website: https://sites.google.com/andrew.cmu.edu/hierarchicalsculpting

Planning and Reasoning with 3D Deformable Objects for Hierarchical Text-to-3D Robotic Shaping

TL;DR

The paper tackles autonomous manipulation and shaping of 3D deformable objects (clay) without explicit 3D goals. It proposes a coarse-to-fine pipeline that first places discrete clay segments to form a coarse shape and then refines it with deformation actions, guided by LLM-based sub-goal generation and a region-based PointNet action model. The system integrates additive placement with deformation-based sculpting in a true text-to-3D pipeline and is evaluated through real-world experiments, CLIP analyses, and human surveys, revealing limitations of CLIP as a sole metric for this task. Key contributions include the first autonomous text-to-3D sculpture pipeline, a cluster-based action model that generalizes across shapes, and demonstration of semantically tunable outputs, with future work focused on smoother finishing and robust quantitative evaluation metrics.

Abstract

Deformable object manipulation remains a key challenge in developing autonomous robotic systems that can be successfully deployed in real-world scenarios. In this work, we explore the challenges of deformable object manipulation through the task of sculpting clay into 3D shapes. We propose the first coarse-to-fine autonomous sculpting system in which the sculpting agent first selects how many and where to place discrete chunks of clay into the workspace to create a coarse shape, and then iteratively refines the shape with sequences of deformation actions. We leverage large language models for sub-goal generation, and train a point cloud region-based action model to predict robot actions from the desired point cloud sub-goals. Additionally, our method is the first autonomous sculpting system that is a real-world text-to-3D shaping pipeline without any explicit 3D goals or sub-goals provided to the system. We demonstrate our method is able to successfully create a set of simple shapes solely from text-based prompting. Furthermore, we explore rigorously how to best quantify success for the text-to-3D sculpting task, and compare existing text-image and text-point cloud similarity metrics to human evaluations for this task. For experimental videos, human evaluation details, and full prompts, please see our project website: https://sites.google.com/andrew.cmu.edu/hierarchicalsculpting

Paper Structure

This paper contains 15 sections, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: A visualization of the sculpting sequence for our proposed text-to-3D shaping method. Our pipeline first creates a coarse shape in the scene with discrete chunks of clay, and then iteratively refines the shape with deformation-based actions.
  • Figure 2: The point cloud processing pipeline first captures a dense point cloud of the robot's workspace (a), then isolates the clay point cloud with position and color thresholding(b), next the point cloud is clustered into 10 regional geometrical patches (c), and finally uniformly down-sampled to ensure each cluster contains an equal number of points.
  • Figure 3: a) The full direct action model pipeline with a cluster-based siamsese PointNet embedding network. b) The synthetic pre-training strategy. c) The real-world action finetuning strategy.
  • Figure 4: Scatter plot with line of best fit for the CLIP and PointCLIP-v2 cosine similarity of text and image/point cloud embeddings of 10 human trajectories creating each shape in clay. The line of best fit's slope for each shape and prompt shows how well the CLIP or PointCLIP-v2 score correlates with our human oracle-created shapes and varying prompts.
  • Figure 5: The human oracle is required to follow the same process of coarse-to-fine sculpting using their hands. The choice of camera orientation for each shape was to best visualize the full sculpture (i.e. top-down versus isometric viewpoint).
  • ...and 1 more figures