Table of Contents
Fetching ...

Find Any Part in 3D

Ziqi Ma, Yisong Yue, Georgia Gkioxari

TL;DR

Find3D tackles data scarcity in 3D part segmentation by building a scalable data engine that leverages 2D foundation models to annotate 3D assets, producing a dataset with $2.1$ million part annotations across 761 object categories and 124,615 unique part types. A transformer-based 3D part model is trained with a simple contrastive objective to map per-point features into a CLIP-like embedding space, enabling open-world, text-driven segmentation for any object part. The approach yields a $260\%$ improvement in mIoU and speeds inference by $6\times$ to $300\times$ over existing open-world methods, and generalizes to unseen objects without dataset-specific finetuning. The authors release a new open-world 3D part benchmark and demonstrate strong scaling effects, suggesting data scale is the key driver of generalization in 3D segmentation.

Abstract

Why don't we have foundation models in 3D yet? A key limitation is data scarcity. For 3D object part segmentation, existing datasets are small in size and lack diversity. We show that it is possible to break this data barrier by building a data engine powered by 2D foundation models. Our data engine automatically annotates any number of object parts: 1755x more unique part types than existing datasets combined. By training on our annotated data with a simple contrastive objective, we obtain an open-world model that generalizes to any part in any object based on any text query. Even when evaluated zero-shot, we outperform existing methods on the datasets they train on. We achieve 260% improvement in mIoU and boost speed by 6x to 300x. Our scaling analysis confirms that this generalization stems from the data scale, which underscores the impact of our data engine. Finally, to advance general-category open-world 3D part segmentation, we release a benchmark covering a wide range of objects and parts. Project website: https://ziqi-ma.github.io/find3dsite/

Find Any Part in 3D

TL;DR

Find3D tackles data scarcity in 3D part segmentation by building a scalable data engine that leverages 2D foundation models to annotate 3D assets, producing a dataset with million part annotations across 761 object categories and 124,615 unique part types. A transformer-based 3D part model is trained with a simple contrastive objective to map per-point features into a CLIP-like embedding space, enabling open-world, text-driven segmentation for any object part. The approach yields a improvement in mIoU and speeds inference by to over existing open-world methods, and generalizes to unseen objects without dataset-specific finetuning. The authors release a new open-world 3D part benchmark and demonstrate strong scaling effects, suggesting data scale is the key driver of generalization in 3D segmentation.

Abstract

Why don't we have foundation models in 3D yet? A key limitation is data scarcity. For 3D object part segmentation, existing datasets are small in size and lack diversity. We show that it is possible to break this data barrier by building a data engine powered by 2D foundation models. Our data engine automatically annotates any number of object parts: 1755x more unique part types than existing datasets combined. By training on our annotated data with a simple contrastive objective, we obtain an open-world model that generalizes to any part in any object based on any text query. Even when evaluated zero-shot, we outperform existing methods on the datasets they train on. We achieve 260% improvement in mIoU and boost speed by 6x to 300x. Our scaling analysis confirms that this generalization stems from the data scale, which underscores the impact of our data engine. Finally, to advance general-category open-world 3D part segmentation, we release a benchmark covering a wide range of objects and parts. Project website: https://ziqi-ma.github.io/find3dsite/

Paper Structure

This paper contains 20 sections, 1 equation, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Find3D is the first general-category 3D model that can segment any part of any object with any text query. We achieve this by building a scalable Data Engine powered by 2D foundation models -- SAM & Gemini -- that automatically annotates 3D assets from the web. Using the labeled data, Find3D trains a transformer-based point cloud model with a contrastive training recipe. Our method works on diverse 3D objects and parts, e.g. the easel, the imaginary animal, and the ceiling fan.
  • Figure 2: The Data Engine. (a) We render Objaverse assets into multiple views and pass each rendering to SAM with gridpoint prompts for segmentation. For each mask, we query Gemini for the corresponding part name, which gives us (mask, text) pairs. We embed the part name into the latent embedding space of a vision and language foundation model such as SigLIP. We back-project mask pixels to obtain the points associated with each label embedding, yielding (points, text embedding) pairs. (b) Example annotations by the data engine.
  • Figure 3: Find3D: an open-world part segmentation model. Find3D takes in a point cloud, voxelizes and serializes the points via space-filling curves into a sequence. The sequence is passed through a transformer architecture which returns a pointwise feature that is in the embedding space of a vision and language foundation model, denoted by $\mathbb{T}$. These features can be queried with any free-form text. Find3D is trained with a contrastive objective. For each (points, text embedding) label from the data engine, we use the averaged feature of these points as the predicted embedding, and pair it with the text embedding to form a positive pair in the contrastive loss.
  • Figure 4: Our benchmark. (a) Examples of Objaverse-General and ShapeNetPart-V2. Objaverse-General contains diverse objects and parts, and ShapeNetPart-V2 is sourced to look similar to ShapeNet-Part to test various methods' generalization capability. (b) Object category breakdown of Objaverse-General, which covers 9 categories from tools to buildings. (c) Comparison with existing benchmarks. We have $\mathbf{5 \times}$ more unique part types, $\mathbf{4.4 \times}$ more total annotated parts, and $\mathbf{2.9\times}$ more object categories.
  • Figure 5: Qualitative results. Left: Find3D performs strongly on Objaverse-General while baseline methods struggle. Right: more examples both from Objaverse-General and PartObjaverse-Tiny, including out-of-distribution objects such as magical animals and complex anime-style characters. Find3D works on diverse object categories with up to 9 parts. It also generalizes to "in-the-wild" iPhone photos (converted to point clouds via off-the-shelf image-to-3D method, as shown at bottom right.
  • ...and 10 more figures