Table of Contents
Fetching ...

Hierarchical Image-Guided 3D Point Cloud Segmentation in Industrial Scenes via Multi-View Bayesian Fusion

Yu Zhu, Naoya Chiba, Koichi Hashimoto

TL;DR

This work tackles 3D segmentation in cluttered industrial environments by introducing a two-stage, image-guided hierarchy that first isolates object instances from top-view 2D masks and then refines to fine-grained parts via multi-view back-projections. It leverages SAM and YOLO-World prompts in a scale-aware rendering framework, with Bayesian updating fusion to enforce cross-view consistency and robustness under occlusion. The approach demonstrates improved part-level accuracy on real factory data and shows generalization to public datasets like PartNet, while offering annotation-efficient advantages through modular 2D supervision. The results highlight strong potential for practical deployment in complex 3D manufacturing scenes, with future work aimed at incorporating depth information and tighter multi-view consistency at the 2D segmentation stage.

Abstract

Reliable 3D segmentation is critical for understanding complex scenes with dense layouts and multi-scale objects, as commonly seen in industrial environments. In such scenarios, heavy occlusion weakens geometric boundaries between objects, and large differences in object scale will cause end-to-end models fail to capture both coarse and fine details accurately. Existing 3D point-based methods require costly annotations, while image-guided methods often suffer from semantic inconsistencies across views. To address these challenges, we propose a hierarchical image-guided 3D segmentation framework that progressively refines segmentation from instance-level to part-level. Instance segmentation involves rendering a top-view image and projecting SAM-generated masks prompted by YOLO-World back onto the 3D point cloud. Part-level segmentation is subsequently performed by rendering multi-view images of each instance obtained from the previous stage and applying the same 2D segmentation and back-projection process at each view, followed by Bayesian updating fusion to ensure semantic consistency across views. Experiments on real-world factory data demonstrate that our method effectively handles occlusion and structural complexity, achieving consistently high per-class mIoU scores. Additional evaluations on public dataset confirm the generalization ability of our framework, highlighting its robustness, annotation efficiency, and adaptability to diverse 3D environments.

Hierarchical Image-Guided 3D Point Cloud Segmentation in Industrial Scenes via Multi-View Bayesian Fusion

TL;DR

This work tackles 3D segmentation in cluttered industrial environments by introducing a two-stage, image-guided hierarchy that first isolates object instances from top-view 2D masks and then refines to fine-grained parts via multi-view back-projections. It leverages SAM and YOLO-World prompts in a scale-aware rendering framework, with Bayesian updating fusion to enforce cross-view consistency and robustness under occlusion. The approach demonstrates improved part-level accuracy on real factory data and shows generalization to public datasets like PartNet, while offering annotation-efficient advantages through modular 2D supervision. The results highlight strong potential for practical deployment in complex 3D manufacturing scenes, with future work aimed at incorporating depth information and tighter multi-view consistency at the 2D segmentation stage.

Abstract

Reliable 3D segmentation is critical for understanding complex scenes with dense layouts and multi-scale objects, as commonly seen in industrial environments. In such scenarios, heavy occlusion weakens geometric boundaries between objects, and large differences in object scale will cause end-to-end models fail to capture both coarse and fine details accurately. Existing 3D point-based methods require costly annotations, while image-guided methods often suffer from semantic inconsistencies across views. To address these challenges, we propose a hierarchical image-guided 3D segmentation framework that progressively refines segmentation from instance-level to part-level. Instance segmentation involves rendering a top-view image and projecting SAM-generated masks prompted by YOLO-World back onto the 3D point cloud. Part-level segmentation is subsequently performed by rendering multi-view images of each instance obtained from the previous stage and applying the same 2D segmentation and back-projection process at each view, followed by Bayesian updating fusion to ensure semantic consistency across views. Experiments on real-world factory data demonstrate that our method effectively handles occlusion and structural complexity, achieving consistently high per-class mIoU scores. Additional evaluations on public dataset confirm the generalization ability of our framework, highlighting its robustness, annotation efficiency, and adaptability to diverse 3D environments.

Paper Structure

This paper contains 12 sections, 6 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (Left) Entire manufacturing factory. (Right) Close-up view of RA equipment.
  • Figure 2: The pipeline of the proposed hierarchical 3D segmentation system. The framework consists of two main stages: (1) coarse instance-level segmentation; (2) fine-grained part-level segmentation.
  • Figure 3: 2D segmentation comparison, From left to right: (a) Input image (b) Instance segmentation (c) Part-level segmentation (d) Single stage segmentation
  • Figure 4: Fine-grained part segmentation results. From left to right: (a) Rendered image (b) YOLO-World Detection (c) YOLO-World+SAM segmentation (d) GT.
  • Figure 5: 3D point cloud segmentation results. From left to right: (a) Projection-only (b) With cluster method (c) With Bayes (d) GT
  • ...and 3 more figures