Hierarchical Image-Guided 3D Point Cloud Segmentation in Industrial Scenes via Multi-View Bayesian Fusion
Yu Zhu, Naoya Chiba, Koichi Hashimoto
TL;DR
This work tackles 3D segmentation in cluttered industrial environments by introducing a two-stage, image-guided hierarchy that first isolates object instances from top-view 2D masks and then refines to fine-grained parts via multi-view back-projections. It leverages SAM and YOLO-World prompts in a scale-aware rendering framework, with Bayesian updating fusion to enforce cross-view consistency and robustness under occlusion. The approach demonstrates improved part-level accuracy on real factory data and shows generalization to public datasets like PartNet, while offering annotation-efficient advantages through modular 2D supervision. The results highlight strong potential for practical deployment in complex 3D manufacturing scenes, with future work aimed at incorporating depth information and tighter multi-view consistency at the 2D segmentation stage.
Abstract
Reliable 3D segmentation is critical for understanding complex scenes with dense layouts and multi-scale objects, as commonly seen in industrial environments. In such scenarios, heavy occlusion weakens geometric boundaries between objects, and large differences in object scale will cause end-to-end models fail to capture both coarse and fine details accurately. Existing 3D point-based methods require costly annotations, while image-guided methods often suffer from semantic inconsistencies across views. To address these challenges, we propose a hierarchical image-guided 3D segmentation framework that progressively refines segmentation from instance-level to part-level. Instance segmentation involves rendering a top-view image and projecting SAM-generated masks prompted by YOLO-World back onto the 3D point cloud. Part-level segmentation is subsequently performed by rendering multi-view images of each instance obtained from the previous stage and applying the same 2D segmentation and back-projection process at each view, followed by Bayesian updating fusion to ensure semantic consistency across views. Experiments on real-world factory data demonstrate that our method effectively handles occlusion and structural complexity, achieving consistently high per-class mIoU scores. Additional evaluations on public dataset confirm the generalization ability of our framework, highlighting its robustness, annotation efficiency, and adaptability to diverse 3D environments.
