Table of Contents
Fetching ...

3D Part Segmentation via Geometric Aggregation of 2D Visual Features

Marco Garosi, Riccardo Tedoldi, Davide Boscaini, Massimiliano Mancini, Nicu Sebe, Fabio Poiesi

TL;DR

This work tackles open-set 3D part segmentation by decoupling part decomposition from semantic labeling and fusing 2D vision foundation model features with 3D geometry. The proposed COPS pipeline renders multiple views, extracts dense 2D features with a frozen model (e.g., DINOv2), back-projects them to 3D, and refines the resulting features through a Geometric Feature Aggregation module that enforces spatial and semantic coherence. A zero-shot head clusters points into parts and uses CLIP-based semantic anchors to assign language-driven labels, with a Hungarian matching step to align clusters to semantic categories. Across five benchmarks, COPS achieves zero-shot state-of-the-art results, demonstrating strong transferability to both synthetic and real-world data, including textureless and colored objects, and rigid and non-rigid shapes.

Abstract

Supervised 3D part segmentation models are tailored for a fixed set of objects and parts, limiting their transferability to open-set, real-world scenarios. Recent works have explored vision-language models (VLMs) as a promising alternative, using multi-view rendering and textual prompting to identify object parts. However, naively applying VLMs in this context introduces several drawbacks, such as the need for meticulous prompt engineering, and fails to leverage the 3D geometric structure of objects. To address these limitations, we propose COPS, a COmprehensive model for Parts Segmentation that blends the semantics extracted from visual concepts and 3D geometry to effectively identify object parts. COPS renders a point cloud from multiple viewpoints, extracts 2D features, projects them back to 3D, and uses a novel geometric-aware feature aggregation procedure to ensure spatial and semantic consistency. Finally, it clusters points into parts and labels them. We demonstrate that COPS is efficient, scalable, and achieves zero-shot state-of-the-art performance across five datasets, covering synthetic and real-world data, texture-less and coloured objects, as well as rigid and non-rigid shapes. The code is available at https://3d-cops.github.io.

3D Part Segmentation via Geometric Aggregation of 2D Visual Features

TL;DR

This work tackles open-set 3D part segmentation by decoupling part decomposition from semantic labeling and fusing 2D vision foundation model features with 3D geometry. The proposed COPS pipeline renders multiple views, extracts dense 2D features with a frozen model (e.g., DINOv2), back-projects them to 3D, and refines the resulting features through a Geometric Feature Aggregation module that enforces spatial and semantic coherence. A zero-shot head clusters points into parts and uses CLIP-based semantic anchors to assign language-driven labels, with a Hungarian matching step to align clusters to semantic categories. Across five benchmarks, COPS achieves zero-shot state-of-the-art results, demonstrating strong transferability to both synthetic and real-world data, including textureless and colored objects, and rigid and non-rigid shapes.

Abstract

Supervised 3D part segmentation models are tailored for a fixed set of objects and parts, limiting their transferability to open-set, real-world scenarios. Recent works have explored vision-language models (VLMs) as a promising alternative, using multi-view rendering and textual prompting to identify object parts. However, naively applying VLMs in this context introduces several drawbacks, such as the need for meticulous prompt engineering, and fails to leverage the 3D geometric structure of objects. To address these limitations, we propose COPS, a COmprehensive model for Parts Segmentation that blends the semantics extracted from visual concepts and 3D geometry to effectively identify object parts. COPS renders a point cloud from multiple viewpoints, extracts 2D features, projects them back to 3D, and uses a novel geometric-aware feature aggregation procedure to ensure spatial and semantic consistency. Finally, it clusters points into parts and labels them. We demonstrate that COPS is efficient, scalable, and achieves zero-shot state-of-the-art performance across five datasets, covering synthetic and real-world data, texture-less and coloured objects, as well as rigid and non-rigid shapes. The code is available at https://3d-cops.github.io.

Paper Structure

This paper contains 13 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The quality of part descriptions significantly affects the segmentation performance of methods based on vision-language models. For example, the performance of PointCLIPv2 zhu_pointclip_2023 (left) deteriorates rapidly when replacing the default textual prompt with a GPT-generated description, the template "This is a depth image of an airplane's [part]", or simply using part names. In contrast, our pipeline (center) achieves more accurate segmentations by disentangling part decomposition from part classification. The improvement is evident when using the same CLIP visual features as PointCLIPv2 (top) and becomes even more pronounced when using DINOv2 oquab2023dinov2 features (bottom), the default choice of COPS. COPS generates more uniform segments with sharper boundaries, resulting in higher segmentation quality.
  • Figure 2: Overview of COPS's feature extractor. $\Phi$ (top) extracts point-level features by (i) rendering multiple views of the object, (ii) processing them with DINOv2, (iii) lifting them in 3D. The Geometric Feature Aggregation module (GFA, bottom) further refines these features by extracting super points (blue points in the second row) and their neighbouring points (red points in the second row) to obtain spatially consistent centroids. These centroids are used to perform spatial- and semantic-consistent feature aggregation, ensuring that the features are both locally consistent and similar across large distances when describing the same part (e.g., the armrest).
  • Figure 3: Qualitative results on ShapeNetPart yi2016shapenetpart. Top to bottom: PointCLIPv2 zhu_pointclip_2023, COPS, ground-truth. These results show that PointCLIPv2 often struggles in describing and segmenting some parts, such as the wheels of the skateboard or the wings of the plane. COPS instead produces a better segmentation, with more uniform part segments and sharper part boundaries.
  • Figure 4: Qualitative results on ScanObjectNN uy2019scanobjectnn. Top to bottom: input point cloud with color information; PointCLIPv2's prediction; COPS's prediction; ground-truth segmentation. COPS outputs better and sharper segmentations than PointCLIPv2.
  • Figure 5: Ablation on ShapeNetPart chang2015shapenet. From left to right: (a) Different prompt types, comparing PointCLIPv2 zhu_pointclip_2023 and COPS. (b) Varying the number of views during rendering, with and without our GFA module. (c) Changing the foundation model. (d) Ablating the GFA module.