Table of Contents
Fetching ...

GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision

Zihui Zhang, Yafei Yang, Hongtao Wen, Bo Yang

TL;DR

GrabS tackles unsupervised 3D object segmentation by introducing a two-stage framework: first learning generative, object-centric priors from single-object data using a VAE/diffusion-based model that aligns objects to a canonical pose, then deploying an embodied RL agent to discover multiple objects in scenes by querying these priors. The multi-object estimation network combines an embodied discovery branch with a segmentation branch trained on pseudo labels, enabling efficient inference once objects are found. Across ScanNet, S3DIS, and a synthetic 6-class dataset, GrabS consistently outperforms strong unsupervised baselines and approaches the performance of some fully supervised methods, demonstrating robust object priors and effective exploration. This approach reduces the reliance on scene-level annotations and has practical implications for autonomous systems and robotics requiring scalable 3D scene understanding.

Abstract

We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two-stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object-centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real-world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.

GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision

TL;DR

GrabS tackles unsupervised 3D object segmentation by introducing a two-stage framework: first learning generative, object-centric priors from single-object data using a VAE/diffusion-based model that aligns objects to a canonical pose, then deploying an embodied RL agent to discover multiple objects in scenes by querying these priors. The multi-object estimation network combines an embodied discovery branch with a segmentation branch trained on pseudo labels, enabling efficient inference once objects are found. Across ScanNet, S3DIS, and a synthetic 6-class dataset, GrabS consistently outperforms strong unsupervised baselines and approaches the performance of some fully supervised methods, demonstrating robust object priors and effective exploration. This approach reduces the reliance on scene-level annotations and has practical implications for autonomous systems and robotics requiring scalable 3D scene understanding.

Abstract

We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two-stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object-centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real-world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.

Paper Structure

This paper contains 28 sections, 2 equations, 14 figures, 19 tables.

Figures (14)

  • Figure 1: An illustration of the overall framework.
  • Figure 2: Object orientation estimation module.
  • Figure 3: The object generative prior learning module with VAE-based and diffusion-based variants.
  • Figure 4: The framework of multi-object estimation network.
  • Figure 5: The steps to generate rewards for the container from our pretrained object-centric network.
  • ...and 9 more figures