Table of Contents
Fetching ...

Diffusion Suction Grasping with Large-Scale Parcel Dataset

Ding-Tao Huang, Xinyi He, Debei Hua, Dongfang Yu, En-Te Lin, Long Zeng

TL;DR

This work tackles robust suction grasping in cluttered parcel scenes by introducing the Parcel-Suction-Dataset, a large synthetic benchmark with 25,000 scenes and 410 million labeled suction grasp poses, and a diffusion-based framework, Diffusion-Suction, that reframes grasp prediction as a conditional denoising process guided by 3D visual cues. The method decouples a point-cloud encoder from a lightweight diffusion head (PCDB), enabling efficient inference while learning spatial point-wise affordances from synthetic data. Across Parcel-Suction-Dataset and SuctionNet-1Billion, Diffusion-Suction achieves state-of-the-art performance and strong generalization, with ablations confirming the value of 3D normals, visibility cues, and an appropriate number of diffusion steps. The work promises practical impact for scalable, reliable parcel handling, and the authors plan to release code and dataset publicly.

Abstract

While recent advances in object suction grasping have shown remarkable progress, significant challenges persist particularly in cluttered and complex parcel handling scenarios. Two fundamental limitations hinder current approaches: (1) the lack of a comprehensive suction grasp dataset tailored for parcel manipulation tasks, and (2) insufficient adaptability to diverse object characteristics including size variations, geometric complexity, and textural diversity. To address these challenges, we present Parcel-Suction-Dataset, a large-scale synthetic dataset containing 25 thousand cluttered scenes with 410 million precision-annotated suction grasp poses. This dataset is generated through our novel geometric sampling algorithm that enables efficient generation of optimal suction grasps incorporating both physical constraints and material properties. We further propose Diffusion-Suction, an innovative framework that reformulates suction grasp prediction as a conditional generation task through denoising diffusion probabilistic models. Our method iteratively refines random noise into suction grasp score maps through visual-conditioned guidance from point cloud observations, effectively learning spatial point-wise affordances from our synthetic dataset. Extensive experiments demonstrate that the simple yet efficient Diffusion-Suction achieves new state-of-the-art performance compared to previous models on both Parcel-Suction-Dataset and the public SuctionNet-1Billion benchmark.

Diffusion Suction Grasping with Large-Scale Parcel Dataset

TL;DR

This work tackles robust suction grasping in cluttered parcel scenes by introducing the Parcel-Suction-Dataset, a large synthetic benchmark with 25,000 scenes and 410 million labeled suction grasp poses, and a diffusion-based framework, Diffusion-Suction, that reframes grasp prediction as a conditional denoising process guided by 3D visual cues. The method decouples a point-cloud encoder from a lightweight diffusion head (PCDB), enabling efficient inference while learning spatial point-wise affordances from synthetic data. Across Parcel-Suction-Dataset and SuctionNet-1Billion, Diffusion-Suction achieves state-of-the-art performance and strong generalization, with ablations confirming the value of 3D normals, visibility cues, and an appropriate number of diffusion steps. The work promises practical impact for scalable, reliable parcel handling, and the authors plan to release code and dataset publicly.

Abstract

While recent advances in object suction grasping have shown remarkable progress, significant challenges persist particularly in cluttered and complex parcel handling scenarios. Two fundamental limitations hinder current approaches: (1) the lack of a comprehensive suction grasp dataset tailored for parcel manipulation tasks, and (2) insufficient adaptability to diverse object characteristics including size variations, geometric complexity, and textural diversity. To address these challenges, we present Parcel-Suction-Dataset, a large-scale synthetic dataset containing 25 thousand cluttered scenes with 410 million precision-annotated suction grasp poses. This dataset is generated through our novel geometric sampling algorithm that enables efficient generation of optimal suction grasps incorporating both physical constraints and material properties. We further propose Diffusion-Suction, an innovative framework that reformulates suction grasp prediction as a conditional generation task through denoising diffusion probabilistic models. Our method iteratively refines random noise into suction grasp score maps through visual-conditioned guidance from point cloud observations, effectively learning spatial point-wise affordances from our synthetic dataset. Extensive experiments demonstrate that the simple yet efficient Diffusion-Suction achieves new state-of-the-art performance compared to previous models on both Parcel-Suction-Dataset and the public SuctionNet-1Billion benchmark.

Paper Structure

This paper contains 19 sections, 7 equations, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: This work addresses two major challenges in the suction grasping task. We propose a novel pipeline that evaluates suction grasping from different perspectives to obtain annotation labels. We propose a novel framework to predict suction grasping poses by reformulating the task as an iterative diffusion-denoising process.
  • Figure 2: Overview of Self-Parcel-Suction-Labeling pipeline. Firstly, utilize the image prompts to generate 3D parcel asset model to generate high-quality 3D assets with geometry and appearance information. Next, create random unstructured parcel scenes with the Bullet and Blender simulator platform. Finally, evaluate candidate suction grasps from four different perspectives to obtain accurate annotation labels.
  • Figure 3: Overview of the Diffusion-Suction architecture. Diffusion-Suction learns an iterative denoising process to denoise random noise into suction grasping score map with the guidance of input point clouds. It uses PointNet++ to extract features as condition guidance. It consists of a novel and lightweight Pointcloud Conditioned Denoising Block. During the filtering stage, we apply Non-Maximal Suppression as a post-processing step. The red spheres represent the predicted scores ($S > 0.8$). It can be observed that the reverse process is a gradual shift from random noise to suction grasping scores.
  • Figure 4: The figure shows the denoising reverse process with 20 inference steps. A blue cylinder represents a suction pose and the color intensity indicates the confidence of suction pose. It can be observed that the reverse process is a gradual shift from pure noise to refined suction score.
  • Figure 5: Qualitative results on SuctionNet-1Billion and Parcel-Suction-Dataset. Good grasp pose is marked in blue. Unsuitable pose is displayed in black.
  • ...and 1 more figures