Precise Pick-and-Place using Score-Based Diffusion Networks

Shih-Wei Guo; Tsu-Ching Hsiao; Yu-Lun Liu; Chun-Yi Lee

Precise Pick-and-Place using Score-Based Diffusion Networks

Shih-Wei Guo, Tsu-Ching Hsiao, Yu-Lun Liu, Chun-Yi Lee

TL;DR

The paper addresses precise pick-and-place under limited data by introducing a two-stage coarse-to-fine diffusion framework operating on $SE(2)^N$ and conditioned on top-down RGB projections. It extends score-based diffusion models to $SE(2)$ and uses ORoI-based refinement to achieve high translational and rotational accuracy, outperforming Transporter-based baselines in both simulation and real-robot experiments. The approach demonstrates data efficiency, requiring as few as one demonstration, and relies solely on RGB inputs, with augmentation strategies to close the gap between training and deployment. This yields a practical, scalable solution for high-precision robotic manipulation and opens avenues for depth-informed or non-top-down extensions.

Abstract

In this paper, we propose a novel coarse-to-fine continuous pose diffusion method to enhance the precision of pick-and-place operations within robotic manipulation tasks. Leveraging the capabilities of diffusion networks, we facilitate the accurate perception of object poses. This accurate perception enhances both pick-and-place success rates and overall manipulation precision. Our methodology utilizes a top-down RGB image projected from an RGB-D camera and adopts a coarse-to-fine architecture. This architecture enables efficient learning of coarse and fine models. A distinguishing feature of our approach is its focus on continuous pose estimation, which enables more precise object manipulation, particularly concerning rotational angles. In addition, we employ pose and color augmentation techniques to enable effective training with limited data. Through extensive experiments in simulated and real-world scenarios, as well as an ablation study, we comprehensively evaluate our proposed methodology. Taken together, the findings validate its effectiveness in achieving high-precision pick-and-place tasks.

Precise Pick-and-Place using Score-Based Diffusion Networks

TL;DR

The paper addresses precise pick-and-place under limited data by introducing a two-stage coarse-to-fine diffusion framework operating on

and conditioned on top-down RGB projections. It extends score-based diffusion models to

and uses ORoI-based refinement to achieve high translational and rotational accuracy, outperforming Transporter-based baselines in both simulation and real-robot experiments. The approach demonstrates data efficiency, requiring as few as one demonstration, and relies solely on RGB inputs, with augmentation strategies to close the gap between training and deployment. This yields a practical, scalable solution for high-precision robotic manipulation and opens avenues for depth-informed or non-top-down extensions.

Abstract

Paper Structure (27 sections, 11 equations, 7 figures, 3 tables)

This paper contains 27 sections, 11 equations, 7 figures, 3 tables.

Introduction
Related Work
Pick-and-Place
Transporter Network and Its Successor
Diffusion Models and Its Application in Manipulation
Background
Score-Based Generative Models
Score-Based Pose Diffusion Models
Methodology
Problem Statement
Framework Overview
Extending Score-Based Pose Diffusion Models
Architecture Design
Data Augmentation
Experimental Results
...and 12 more sections

Figures (7)

Figure 1: The proposed two-stage pose diffusion framework. (a) The coarse-to-fine stages for estimating the pick and place poses. (b) The pose diffusion models in (a) comprise a conditioning part and a denoising part.
Figure 2: An illustration of the iterative refinement of pose estimates through denoising steps. White arrows represent the ground truth, while multi-colored arrows, transitioning from $i=L$ to $i=1$, signify the evolving pose estimate at each step.
Figure 3: An illustration of the pose augmentation process that alters the pick and place poses for training the coarse model.
Figure 4: An illustration of the real robotic hardware setup.
Figure 5: Simulation and real-world tasks. The top row depicts the initial states, while the bottom row shows the final states after task completion. The real-world tasks were executed using our methodology on a robotic arm.
...and 2 more figures

Precise Pick-and-Place using Score-Based Diffusion Networks

TL;DR

Abstract

Precise Pick-and-Place using Score-Based Diffusion Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)