SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking
Nico Leuze, Maximilian Hoh, Samed Doğan, Nicolas R. -Peña, Alfred Schoettl
TL;DR
This work tackles robust 6D pose estimation in cluttered industrial bin-picking by proposing SDT-6D, a fully sparse, depth-only, multi-view framework. It fuses depth maps into either a fine-grained point cloud or a sparse TSDF, then uses a staged RoI and objectness heatmap, plus a density-aware sparse transformer, to focus computation on foreground regions; a novel per-voxel voting head yields simultaneous poses for multiple objects, refined by ICP. Key contributions include a fully sparse 3D representation, a two-stage heatmap mechanism, a dual-branch sparse transformer, and a per-voxel voting strategy that scales to arbitrary object counts while preserving high geometric fidelity. The approach achieves competitive results on IPD and MV-YCB-SymMovCam, demonstrating strong performance in highly cluttered, multi-view bin-picking scenarios and highlighting the practicality of depth-only, scene-adaptive 6D pose estimation with efficient memory use.
Abstract
Accurately recovering 6D poses in densely packed industrial bin-picking environments remain a serious challenge, owing to occlusions, reflections, and textureless parts. We introduce a holistic depth-only 6D pose estimation approach that fuses multi-view depth maps into either a fine-grained 3D point cloud in its vanilla version, or a sparse Truncated Signed Distance Field (TSDF). At the core of our framework lies a staged heatmap mechanism that yields scene-adaptive attention priors across different resolutions, steering computation toward foreground regions, thus keeping memory requirements at high resolutions feasible. Along, we propose a density-aware sparse transformer block that dynamically attends to (self-) occlusions and the non-uniform distribution of 3D data. While sparse 3D approaches has proven effective for long-range perception, its potential in close-range robotic applications remains underexplored. Our framework operates fully sparse, enabling high-resolution volumetric representations to capture fine geometric details crucial for accurate pose estimation in clutter. Our method processes the entire scene integrally, predicting the 6D pose via a novel per-voxel voting strategy, allowing simultaneous pose predictions for an arbitrary number of target objects. We validate our method on the recently published IPD and MV-YCB multi-view datasets, demonstrating competitive performance in heavily cluttered industrial and household bin picking scenarios.
