Precise Pick-and-Place using Score-Based Diffusion Networks
Shih-Wei Guo, Tsu-Ching Hsiao, Yu-Lun Liu, Chun-Yi Lee
TL;DR
The paper addresses precise pick-and-place under limited data by introducing a two-stage coarse-to-fine diffusion framework operating on $SE(2)^N$ and conditioned on top-down RGB projections. It extends score-based diffusion models to $SE(2)$ and uses ORoI-based refinement to achieve high translational and rotational accuracy, outperforming Transporter-based baselines in both simulation and real-robot experiments. The approach demonstrates data efficiency, requiring as few as one demonstration, and relies solely on RGB inputs, with augmentation strategies to close the gap between training and deployment. This yields a practical, scalable solution for high-precision robotic manipulation and opens avenues for depth-informed or non-top-down extensions.
Abstract
In this paper, we propose a novel coarse-to-fine continuous pose diffusion method to enhance the precision of pick-and-place operations within robotic manipulation tasks. Leveraging the capabilities of diffusion networks, we facilitate the accurate perception of object poses. This accurate perception enhances both pick-and-place success rates and overall manipulation precision. Our methodology utilizes a top-down RGB image projected from an RGB-D camera and adopts a coarse-to-fine architecture. This architecture enables efficient learning of coarse and fine models. A distinguishing feature of our approach is its focus on continuous pose estimation, which enables more precise object manipulation, particularly concerning rotational angles. In addition, we employ pose and color augmentation techniques to enable effective training with limited data. Through extensive experiments in simulated and real-world scenarios, as well as an ablation study, we comprehensively evaluate our proposed methodology. Taken together, the findings validate its effectiveness in achieving high-precision pick-and-place tasks.
