Table of Contents
Fetching ...

Pickalo: Leveraging 6D Pose Estimation for Low-Cost Industrial Bin Picking

Alessandro Tarsi, Matteo Mastrogiuseppe, Saverio Taliani, Simone Cortinovis, Ugo Pattacini

Abstract

Bin picking in real industrial environments remains challenging due to severe clutter, occlusions, and the high cost of traditional 3D sensing setups. We present Pickalo, a modular 6D pose-based bin-picking pipeline built entirely on low-cost hardware. A wrist-mounted RGB-D camera actively explores the scene from multiple viewpoints, while raw stereo streams are processed with BridgeDepth to obtain refined depth maps suitable for accurate collision reasoning. Object instances are segmented with a Mask-RCNN model trained purely on photorealistic synthetic data and localized using the zero-shot SAM-6D pose estimator. A pose buffer module fuses multi-view observations over time, handling object symmetries and significantly reducing pose noise. Offline, we generate and curate large sets of antipodal grasp candidates per object; online, a utility-based ranking and fast collision checking are queried for the grasp planning. Deployed on a UR5e with a parallel-jaw gripper and an Intel RealSense D435i, Pickalo achieves up to 600 mean picks per hour with 96-99% grasp success and robust performance over 30-minute runs on densely filled euroboxes. Ablation studies demonstrate the benefits of enhanced depth estimation and of the pose buffer for long-term stability and throughput in realistic industrial conditions. Videos are available at https://mesh-iit.github.io/project-jl2-camozzi/

Pickalo: Leveraging 6D Pose Estimation for Low-Cost Industrial Bin Picking

Abstract

Bin picking in real industrial environments remains challenging due to severe clutter, occlusions, and the high cost of traditional 3D sensing setups. We present Pickalo, a modular 6D pose-based bin-picking pipeline built entirely on low-cost hardware. A wrist-mounted RGB-D camera actively explores the scene from multiple viewpoints, while raw stereo streams are processed with BridgeDepth to obtain refined depth maps suitable for accurate collision reasoning. Object instances are segmented with a Mask-RCNN model trained purely on photorealistic synthetic data and localized using the zero-shot SAM-6D pose estimator. A pose buffer module fuses multi-view observations over time, handling object symmetries and significantly reducing pose noise. Offline, we generate and curate large sets of antipodal grasp candidates per object; online, a utility-based ranking and fast collision checking are queried for the grasp planning. Deployed on a UR5e with a parallel-jaw gripper and an Intel RealSense D435i, Pickalo achieves up to 600 mean picks per hour with 96-99% grasp success and robust performance over 30-minute runs on densely filled euroboxes. Ablation studies demonstrate the benefits of enhanced depth estimation and of the pose buffer for long-term stability and throughput in realistic industrial conditions. Videos are available at https://mesh-iit.github.io/project-jl2-camozzi/

Paper Structure

This paper contains 20 sections, 6 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The experimental setup consists of a UR5e manipulator with a consumer-grade camera attached to the wrist. The bin is a standard eurobox heavily filled with small metallic objects.
  • Figure 2: Overview of the presented pipeline. A stereo-pair image is acquired and processed by the depth estimation block to obtain an enhanced depth reconstruction. The resulting depth is aligned to the left RGB frame and provided to the 6D Pose Estimation model, together with the object model CAD. The scene state is reconstructed by combining pose estimates across multiple views, occupied voxels, and static objects. An example of this representation is shown in the top-right corner. Given the scene state, the grasp planning ranks and tests pre-computed grasp annotations to find a collision-free grasping trajectory.
  • Figure 3: RealSense on the left. BridgeDepth on the right.
  • Figure 4: Example of various grasping poses with the corresponding utility scores $S(g)$ given in percentage. The grasping poses are sorted in descending order.
  • Figure 5: The sorted grasp poses are checked following the ranking order until a feasible grasp pose is found. Grasp poses leading to a collision are discarded.
  • ...and 4 more figures