Table of Contents
Fetching ...

SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking

Nico Leuze, Maximilian Hoh, Samed Doğan, Nicolas R. -Peña, Alfred Schoettl

TL;DR

This work tackles robust 6D pose estimation in cluttered industrial bin-picking by proposing SDT-6D, a fully sparse, depth-only, multi-view framework. It fuses depth maps into either a fine-grained point cloud or a sparse TSDF, then uses a staged RoI and objectness heatmap, plus a density-aware sparse transformer, to focus computation on foreground regions; a novel per-voxel voting head yields simultaneous poses for multiple objects, refined by ICP. Key contributions include a fully sparse 3D representation, a two-stage heatmap mechanism, a dual-branch sparse transformer, and a per-voxel voting strategy that scales to arbitrary object counts while preserving high geometric fidelity. The approach achieves competitive results on IPD and MV-YCB-SymMovCam, demonstrating strong performance in highly cluttered, multi-view bin-picking scenarios and highlighting the practicality of depth-only, scene-adaptive 6D pose estimation with efficient memory use.

Abstract

Accurately recovering 6D poses in densely packed industrial bin-picking environments remain a serious challenge, owing to occlusions, reflections, and textureless parts. We introduce a holistic depth-only 6D pose estimation approach that fuses multi-view depth maps into either a fine-grained 3D point cloud in its vanilla version, or a sparse Truncated Signed Distance Field (TSDF). At the core of our framework lies a staged heatmap mechanism that yields scene-adaptive attention priors across different resolutions, steering computation toward foreground regions, thus keeping memory requirements at high resolutions feasible. Along, we propose a density-aware sparse transformer block that dynamically attends to (self-) occlusions and the non-uniform distribution of 3D data. While sparse 3D approaches has proven effective for long-range perception, its potential in close-range robotic applications remains underexplored. Our framework operates fully sparse, enabling high-resolution volumetric representations to capture fine geometric details crucial for accurate pose estimation in clutter. Our method processes the entire scene integrally, predicting the 6D pose via a novel per-voxel voting strategy, allowing simultaneous pose predictions for an arbitrary number of target objects. We validate our method on the recently published IPD and MV-YCB multi-view datasets, demonstrating competitive performance in heavily cluttered industrial and household bin picking scenarios.

SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking

TL;DR

This work tackles robust 6D pose estimation in cluttered industrial bin-picking by proposing SDT-6D, a fully sparse, depth-only, multi-view framework. It fuses depth maps into either a fine-grained point cloud or a sparse TSDF, then uses a staged RoI and objectness heatmap, plus a density-aware sparse transformer, to focus computation on foreground regions; a novel per-voxel voting head yields simultaneous poses for multiple objects, refined by ICP. Key contributions include a fully sparse 3D representation, a two-stage heatmap mechanism, a dual-branch sparse transformer, and a per-voxel voting strategy that scales to arbitrary object counts while preserving high geometric fidelity. The approach achieves competitive results on IPD and MV-YCB-SymMovCam, demonstrating strong performance in highly cluttered, multi-view bin-picking scenarios and highlighting the practicality of depth-only, scene-adaptive 6D pose estimation with efficient memory use.

Abstract

Accurately recovering 6D poses in densely packed industrial bin-picking environments remain a serious challenge, owing to occlusions, reflections, and textureless parts. We introduce a holistic depth-only 6D pose estimation approach that fuses multi-view depth maps into either a fine-grained 3D point cloud in its vanilla version, or a sparse Truncated Signed Distance Field (TSDF). At the core of our framework lies a staged heatmap mechanism that yields scene-adaptive attention priors across different resolutions, steering computation toward foreground regions, thus keeping memory requirements at high resolutions feasible. Along, we propose a density-aware sparse transformer block that dynamically attends to (self-) occlusions and the non-uniform distribution of 3D data. While sparse 3D approaches has proven effective for long-range perception, its potential in close-range robotic applications remains underexplored. Our framework operates fully sparse, enabling high-resolution volumetric representations to capture fine geometric details crucial for accurate pose estimation in clutter. Our method processes the entire scene integrally, predicting the 6D pose via a novel per-voxel voting strategy, allowing simultaneous pose predictions for an arbitrary number of target objects. We validate our method on the recently published IPD and MV-YCB multi-view datasets, demonstrating competitive performance in heavily cluttered industrial and household bin picking scenarios.

Paper Structure

This paper contains 17 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Voxel occupancy statistics on the IPD dataset: (a) Occupied voxels (red) overlaid on the dense workspace grid illustrate the highly irregular and extremely sparse spatial distribution typical for bin-picking scenes ($\vartheta=$ 8 mm). (b) Dense grids incur cubic complexity with increased resolution. Sparse voxel occupancy grows only squarely, enabling a more efficient 3D representation.
  • Figure 2: Overview of our framework architecture. Best viewed in color. [1]: Multiple raw depth observations are fused and discretized into a fine-grained sparse 3D voxel grid. [2]: The grid is more coarsely discretized (2.1) and fed into the RoI Heatmap that consists of a sparse U-Net. The most potential foreground voxels (visualized in ascending red) (2.2), along the global context features are lifted to the original high-resolution, while background voxels are dropped via soft assignment (2.3). [3]: The sparsified, yet high-resolutional, sparse grid is fed into the Objectness Heatmap. Fully sparse feature extraction layers are applied to obtain sharp objectness votes (3.1) and predict the per-voxel class (3.2). Based on the objectness scoring, we further sparsify the grid via an adaptive $topK$ selector (3.3). [5]: The resulting voxels (depicted in blue in the bottom left image) represent the extremely sparse input to the Sparse 6D Pose Head. Sparse Convolutional Layers are interleaved with Sparse Transformer Blocks[4] to extract fine geometric details and local context. Relative translation offsets (5.1) and rotation estimations (5.2) are predicted on a voxel-level. The clusters, formed by the offset predictions, are used for instance indexing (5.3). Given the canonical object point clouds, we apply a batched ICP for pose refinement (5.4).
  • Figure 3: A single sparse transformer block. Multi-Head Window Self-Attention is performed in two branches with small and medium-sized windows. One attention branch captures fine geometric details, while the coarser accounts for neighborhood context. A MLP module enables dynamic feature selection.