Table of Contents
Fetching ...

3x2: 3D Object Part Segmentation by 2D Semantic Correspondences

Anh Thai, Weiyao Wang, Hao Tang, Stefan Stojanov, Matt Feiszli, James M. Rehg

TL;DR

3-By-2 addresses 3D object part segmentation under limited 3D annotations by transferring labels from richly annotated 2D datasets through diffusion-model–based semantic correspondences across multi-view renders, without training. The method renders the object in multiple views, performs 2D segmentation from a 2D database, and aggregates predictions with a novel mask-consistency module before back-projecting to 3D, all in a language-free, training-free framework. It introduces non-overlapping 2D mask generation and mask-level label transfer to preserve boundary precision across granularities and demonstrates strong cross-category transfer with state-of-the-art performance in zero-shot and few-shot settings on PartNetE and PartNet. The work provides extensive ablations, analyzes database composition, and includes qualitative results on real and synthetic objects, underscoring the practical impact of leveraging visual semantic correspondences for rapid 3D part segmentation without task-specific training.

Abstract

3D object part segmentation is essential in computer vision applications. While substantial progress has been made in 2D object part segmentation, the 3D counterpart has received less attention, in part due to the scarcity of annotated 3D datasets, which are expensive to collect. In this work, we propose to leverage a few annotated 3D shapes or richly annotated 2D datasets to perform 3D object part segmentation. We present our novel approach, termed 3-By-2 that achieves SOTA performance on different benchmarks with various granularity levels. By using features from pretrained foundation models and exploiting semantic and geometric correspondences, we are able to overcome the challenges of limited 3D annotations. Our approach leverages available 2D labels, enabling effective 3D object part segmentation. Our method 3-By-2 can accommodate various part taxonomies and granularities, demonstrating interesting part label transfer ability across different object categories. Project website: \url{https://ngailapdi.github.io/projects/3by2/}.

3x2: 3D Object Part Segmentation by 2D Semantic Correspondences

TL;DR

3-By-2 addresses 3D object part segmentation under limited 3D annotations by transferring labels from richly annotated 2D datasets through diffusion-model–based semantic correspondences across multi-view renders, without training. The method renders the object in multiple views, performs 2D segmentation from a 2D database, and aggregates predictions with a novel mask-consistency module before back-projecting to 3D, all in a language-free, training-free framework. It introduces non-overlapping 2D mask generation and mask-level label transfer to preserve boundary precision across granularities and demonstrates strong cross-category transfer with state-of-the-art performance in zero-shot and few-shot settings on PartNetE and PartNet. The work provides extensive ablations, analyzes database composition, and includes qualitative results on real and synthetic objects, underscoring the practical impact of leveraging visual semantic correspondences for rapid 3D part segmentation without task-specific training.

Abstract

3D object part segmentation is essential in computer vision applications. While substantial progress has been made in 2D object part segmentation, the 3D counterpart has received less attention, in part due to the scarcity of annotated 3D datasets, which are expensive to collect. In this work, we propose to leverage a few annotated 3D shapes or richly annotated 2D datasets to perform 3D object part segmentation. We present our novel approach, termed 3-By-2 that achieves SOTA performance on different benchmarks with various granularity levels. By using features from pretrained foundation models and exploiting semantic and geometric correspondences, we are able to overcome the challenges of limited 3D annotations. Our approach leverages available 2D labels, enabling effective 3D object part segmentation. Our method 3-By-2 can accommodate various part taxonomies and granularities, demonstrating interesting part label transfer ability across different object categories. Project website: \url{https://ngailapdi.github.io/projects/3by2/}.
Paper Structure (30 sections, 3 equations, 16 figures, 12 tables)

This paper contains 30 sections, 3 equations, 16 figures, 12 tables.

Figures (16)

  • Figure 1: We propose 3-By-2, a novel training-free method for low-shot 3D object part segmentation that achieves SOTA performance on both zero-shot and few-shot settings.
  • Figure 2: Overview of our proposed method 3-By-2. (1) Render the input object in multiple camera viewpoints, (2) Perform 2D part segmentation on each view individually by leveraging 2D semantic correspondences and 2D class-agnostic segmentation model, (3) Aggregate the 2D predictions from multiple views using our proposed mask-consistency module, (4) Back-project the predictions to 3D using depth information.
  • Figure 3: The process of pixel-level part label transferring. For each pixel $p$ in the query image $I_k$, we perform the following: (1) Extract the feature $f(p)$, along with the feature grid for each image $I_\mathcal{D}$ in the database $\mathcal{D}$, (2) Measure cosine similarity between $f(p)$ and the feature of each pixel within each feature grid, (3) Obtain the best match of $p$ over $\mathcal{D}$ by determining the most similar pixel $p_\mathcal{D}$ over all images $I_\mathcal{D}$, (4) Assign the label of $p$ is to be the label of $p_\mathcal{D}$.
  • Figure 4: (\ref{['fig:non_overlapping']}) Non-overlapping 2D Mask Proposal. We address the issue of overlapping masks produced by SAM. The masks are first sorted by their areas. Subsequently, the smaller masks are stacked on top of the larger ones. Non-overlapping masks are obtained by taking the visible segment of each mask. (\ref{['fig:ablation']}) Different mask sampling strategies for label transfer. Our strategy provides accurate, dense prediction with clear part boundaries.
  • Figure 5: Two approaches to aggregate 3D part labels from multiple 2D views. Aggregating 3D part labels from multiple 2D views through geometric correspondence can be achieved by either point or mask label consistency.
  • ...and 11 more figures