Table of Contents
Fetching ...

Manipulating Elasto-Plastic Objects With 3D Occupancy and Learning-Based Predictive Control

Zhen Zhang, Xiangyu Chu, Yunxi Tang, Lulu Zhao, Jing Huang, Zhongliang Jiang, K. W. Samuel Au

TL;DR

This work addresses the challenge of manipulating elasto-plastic volumetric objects under quasi-static motion by proposing a dense 3D occupancy representation and a learning-based framework that combines a 3D CNN-GNN dynamics model with model predictive control. It introduces a data collection platform and a pipeline to generate dense 3D occupancy ground truth from multi-view RGB-D data, training an occupancy prediction network that operates during manipulation. The proposed approach demonstrates accurate state representation, improved dynamics prediction, and successful shaping of plasticine into target geometries in both simulation and real-world experiments, outperforming voxel-based baselines. By enabling dense internal-state reasoning and occlusion-robust planning, the framework advances practical manipulation of soft, irreversible materials with potential impact on everyday robotics and deformable-object handling tasks.

Abstract

Manipulating elasto-plastic objects remains a significant challenge due to severe self-occlusion, difficulties of representation, and complicated dynamics. This work proposes a novel framework for elasto-plastic object manipulation with a quasi-static assumption for motions, leveraging 3D occupancy to represent such objects, a learned dynamics model trained with 3D occupancy, and a learning-based predictive control algorithm to address these challenges effectively. We build a novel data collection platform to collect full spatial information and propose a pipeline for generating a 3D occupancy dataset. To infer the 3D occupancy during manipulation, an occupancy prediction network is trained with multiple RGB images supervised by the generated dataset. We design a deep neural network empowered by a 3D convolution neural network (CNN) and a graph neural network (GNN) to predict the complex deformation with the inferred 3D occupancy results. A learning-based predictive control algorithm is introduced to plan the robot actions, incorporating a novel shape-based action initialization module specifically designed to improve the planner efficiency. The proposed framework in this paper can successfully shape the elasto-plastic objects into a given goal shape and has been verified in various experiments both in simulation and the real world.

Manipulating Elasto-Plastic Objects With 3D Occupancy and Learning-Based Predictive Control

TL;DR

This work addresses the challenge of manipulating elasto-plastic volumetric objects under quasi-static motion by proposing a dense 3D occupancy representation and a learning-based framework that combines a 3D CNN-GNN dynamics model with model predictive control. It introduces a data collection platform and a pipeline to generate dense 3D occupancy ground truth from multi-view RGB-D data, training an occupancy prediction network that operates during manipulation. The proposed approach demonstrates accurate state representation, improved dynamics prediction, and successful shaping of plasticine into target geometries in both simulation and real-world experiments, outperforming voxel-based baselines. By enabling dense internal-state reasoning and occlusion-robust planning, the framework advances practical manipulation of soft, irreversible materials with potential impact on everyday robotics and deformable-object handling tasks.

Abstract

Manipulating elasto-plastic objects remains a significant challenge due to severe self-occlusion, difficulties of representation, and complicated dynamics. This work proposes a novel framework for elasto-plastic object manipulation with a quasi-static assumption for motions, leveraging 3D occupancy to represent such objects, a learned dynamics model trained with 3D occupancy, and a learning-based predictive control algorithm to address these challenges effectively. We build a novel data collection platform to collect full spatial information and propose a pipeline for generating a 3D occupancy dataset. To infer the 3D occupancy during manipulation, an occupancy prediction network is trained with multiple RGB images supervised by the generated dataset. We design a deep neural network empowered by a 3D convolution neural network (CNN) and a graph neural network (GNN) to predict the complex deformation with the inferred 3D occupancy results. A learning-based predictive control algorithm is introduced to plan the robot actions, incorporating a novel shape-based action initialization module specifically designed to improve the planner efficiency. The proposed framework in this paper can successfully shape the elasto-plastic objects into a given goal shape and has been verified in various experiments both in simulation and the real world.

Paper Structure

This paper contains 31 sections, 3 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Overview of our framework and results. We predict 3D occupancy from RGB images, learn the dynamics with a deep neural network, and apply learning-based predictive control with shape-based action initialization to deform the object into a goal shape.
  • Figure 2: Inferred 3D occupancy of manipulation scenario in the simulator. (a) RGB observation. (b) Inferred 3D occupancy visualization. The grey, yellow, and dark blue parts represent the gripper, plasticine, and operating plane, respectively.
  • Figure 3: 3D occupancy prediction framework. We use four cameras to extract multi-scale features, fuse them with 2D-3D spatial attention, and predict 3D occupancy, supervised by the ground truth.
  • Figure 4: Overall structure of our proposed 3D CNN-Based dynamics model. (a) The 3D occupancy of the current step $t$ and previous three steps $t-3 \cdots t-1$ is down-sampled to construct a voxel-based state graph, and node features $f_{j}$ are extracted through a state encoder. (b) The 3D occupancy of the current step $t$ is fed into a 3D sparse CNN to learn multi-scale semantic and spatial features. The learned voxel-wise features of each graph node are then retrieved and summarized into a feature set from multiple levels through a voxel set abstraction module. (c) During the training phase, the aggregated node features are concatenated to the encoded node features $f_{j}$ and then fed into the GNN for dynamics model training. (d) During evaluation, voxel-wise features are masked and only used at the initial step ($t=0$).
  • Figure 5: Pipeline of our shape-based action initialization for the gripper. The point cloud $s_{p}$ colored blue is the initial state of plasticine and the point cloud $s_{gt}$ colored yellow is the goal of plasticine (i.e., letter "A"). (a) Align $s_{p}$ to $s_{gt}$. (a)$\rightarrow$(b) Filter out points close to $s_{gt}$ (c) Segment the point clouds into $N\times2$ regions along the $x$-axis and $y$-axis as two branches. (d) Calculate the cost of moving those points using Euclidean distance in each region, respectively. (e) Select two parts with the maximum cost from two branches respectively, and calculate the direction and center of the line connecting the two fingers as the reference initialization. (f) Visualization of initialized actions.
  • ...and 8 more figures