Table of Contents
Fetching ...

Precise Pick-and-Place using Score-Based Diffusion Networks

Shih-Wei Guo, Tsu-Ching Hsiao, Yu-Lun Liu, Chun-Yi Lee

TL;DR

The paper addresses precise pick-and-place under limited data by introducing a two-stage coarse-to-fine diffusion framework operating on $SE(2)^N$ and conditioned on top-down RGB projections. It extends score-based diffusion models to $SE(2)$ and uses ORoI-based refinement to achieve high translational and rotational accuracy, outperforming Transporter-based baselines in both simulation and real-robot experiments. The approach demonstrates data efficiency, requiring as few as one demonstration, and relies solely on RGB inputs, with augmentation strategies to close the gap between training and deployment. This yields a practical, scalable solution for high-precision robotic manipulation and opens avenues for depth-informed or non-top-down extensions.

Abstract

In this paper, we propose a novel coarse-to-fine continuous pose diffusion method to enhance the precision of pick-and-place operations within robotic manipulation tasks. Leveraging the capabilities of diffusion networks, we facilitate the accurate perception of object poses. This accurate perception enhances both pick-and-place success rates and overall manipulation precision. Our methodology utilizes a top-down RGB image projected from an RGB-D camera and adopts a coarse-to-fine architecture. This architecture enables efficient learning of coarse and fine models. A distinguishing feature of our approach is its focus on continuous pose estimation, which enables more precise object manipulation, particularly concerning rotational angles. In addition, we employ pose and color augmentation techniques to enable effective training with limited data. Through extensive experiments in simulated and real-world scenarios, as well as an ablation study, we comprehensively evaluate our proposed methodology. Taken together, the findings validate its effectiveness in achieving high-precision pick-and-place tasks.

Precise Pick-and-Place using Score-Based Diffusion Networks

TL;DR

The paper addresses precise pick-and-place under limited data by introducing a two-stage coarse-to-fine diffusion framework operating on and conditioned on top-down RGB projections. It extends score-based diffusion models to and uses ORoI-based refinement to achieve high translational and rotational accuracy, outperforming Transporter-based baselines in both simulation and real-robot experiments. The approach demonstrates data efficiency, requiring as few as one demonstration, and relies solely on RGB inputs, with augmentation strategies to close the gap between training and deployment. This yields a practical, scalable solution for high-precision robotic manipulation and opens avenues for depth-informed or non-top-down extensions.

Abstract

In this paper, we propose a novel coarse-to-fine continuous pose diffusion method to enhance the precision of pick-and-place operations within robotic manipulation tasks. Leveraging the capabilities of diffusion networks, we facilitate the accurate perception of object poses. This accurate perception enhances both pick-and-place success rates and overall manipulation precision. Our methodology utilizes a top-down RGB image projected from an RGB-D camera and adopts a coarse-to-fine architecture. This architecture enables efficient learning of coarse and fine models. A distinguishing feature of our approach is its focus on continuous pose estimation, which enables more precise object manipulation, particularly concerning rotational angles. In addition, we employ pose and color augmentation techniques to enable effective training with limited data. Through extensive experiments in simulated and real-world scenarios, as well as an ablation study, we comprehensively evaluate our proposed methodology. Taken together, the findings validate its effectiveness in achieving high-precision pick-and-place tasks.
Paper Structure (27 sections, 11 equations, 7 figures, 3 tables)

This paper contains 27 sections, 11 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The proposed two-stage pose diffusion framework. (a) The coarse-to-fine stages for estimating the pick and place poses. (b) The pose diffusion models in (a) comprise a conditioning part and a denoising part.
  • Figure 2: An illustration of the iterative refinement of pose estimates through denoising steps. White arrows represent the ground truth, while multi-colored arrows, transitioning from $i=L$ to $i=1$, signify the evolving pose estimate at each step.
  • Figure 3: An illustration of the pose augmentation process that alters the pick and place poses for training the coarse model.
  • Figure 4: An illustration of the real robotic hardware setup.
  • Figure 5: Simulation and real-world tasks. The top row depicts the initial states, while the bottom row shows the final states after task completion. The real-world tasks were executed using our methodology on a robotic arm.
  • ...and 2 more figures