Table of Contents
Fetching ...

Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding

Guofeng Mei, Luigi Riz, Yiming Wang, Fabio Poiesi

TL;DR

GeoZe tackles zero-shot 3D point cloud understanding by transferring 2D Vision-Language Model features to 3D points without learning. It introduces a geometrically-driven, training-free aggregation that jointly leverages the point cloud geometry and VLM features for each point via local-to-global steps over superpoints and an anchor-based refinement to preserve language alignment. The approach yields state-of-the-art results across shape classification, part segmentation, and semantic segmentation on diverse synthetic and real datasets, including indoor and outdoor scenes, while maintaining a lightweight, training-free paradigm. GeoZe thus provides a practical, geometry-aware bridge between 2D VLMs and 3D point clouds with modest additional computation.

Abstract

Zero-shot 3D point cloud understanding can be achieved via 2D Vision-Language Models (VLMs). Existing strategies directly map Vision-Language Models from 2D pixels of rendered or captured views to 3D points, overlooking the inherent and expressible point cloud geometric structure. Geometrically similar or close regions can be exploited for bolstering point cloud understanding as they are likely to share semantic information. To this end, we introduce the first training-free aggregation technique that leverages the point cloud's 3D geometric structure to improve the quality of the transferred Vision-Language Models. Our approach operates iteratively, performing local-to-global aggregation based on geometric and semantic point-level reasoning. We benchmark our approach on three downstream tasks, including classification, part segmentation, and semantic segmentation, with a variety of datasets representing both synthetic/real-world, and indoor/outdoor scenarios. Our approach achieves new state-of-the-art results in all benchmarks. Our approach operates iteratively, performing local-to-global aggregation based on geometric and semantic point-level reasoning. Code and dataset are available at https://luigiriz.github.io/geoze-website/

Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding

TL;DR

GeoZe tackles zero-shot 3D point cloud understanding by transferring 2D Vision-Language Model features to 3D points without learning. It introduces a geometrically-driven, training-free aggregation that jointly leverages the point cloud geometry and VLM features for each point via local-to-global steps over superpoints and an anchor-based refinement to preserve language alignment. The approach yields state-of-the-art results across shape classification, part segmentation, and semantic segmentation on diverse synthetic and real datasets, including indoor and outdoor scenes, while maintaining a lightweight, training-free paradigm. GeoZe thus provides a practical, geometry-aware bridge between 2D VLMs and 3D point clouds with modest additional computation.

Abstract

Zero-shot 3D point cloud understanding can be achieved via 2D Vision-Language Models (VLMs). Existing strategies directly map Vision-Language Models from 2D pixels of rendered or captured views to 3D points, overlooking the inherent and expressible point cloud geometric structure. Geometrically similar or close regions can be exploited for bolstering point cloud understanding as they are likely to share semantic information. To this end, we introduce the first training-free aggregation technique that leverages the point cloud's 3D geometric structure to improve the quality of the transferred Vision-Language Models. Our approach operates iteratively, performing local-to-global aggregation based on geometric and semantic point-level reasoning. We benchmark our approach on three downstream tasks, including classification, part segmentation, and semantic segmentation, with a variety of datasets representing both synthetic/real-world, and indoor/outdoor scenarios. Our approach achieves new state-of-the-art results in all benchmarks. Our approach operates iteratively, performing local-to-global aggregation based on geometric and semantic point-level reasoning. Code and dataset are available at https://luigiriz.github.io/geoze-website/
Paper Structure (25 sections, 6 equations, 11 figures, 8 tables)

This paper contains 25 sections, 6 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Given a set of dense per-pixel VLM representations (e.g. CLIP radford2021learning) extracted from different viewpoint images of a point cloud, our approach is the first geometrically-driven aggregation technique to effectively transfer these image representations to 3D points. We use geometric features (e.g. FPFH rusu2009fast, normals) extracted from the point cloud to denoise the VLM representations through an iterative process. This process begins by aggregating information locally and then extends to operate globally, thereby facilitating improvements across a variety of downstream tasks.
  • Figure 2: Overview of the GeoZe framework. GeoZe first clusters point cloud $\bm{\mathcal{P}}$ into superpoints $\bar{\bm{\mathcal{P}}}$ along with their associated geometric representation $\bar{\bm{\mathcal{G}}}$, VLM representation $\bar{\bm{\mathcal{F}}}$, and anchors ${\bm{\mathcal{C}}}$. For each superpoint $\bar{\bm{p}_j}$, we identify its $k$NN within the point cloud to form a patch $\bm{\mathcal{P}}^j$ with their features $\bm{\mathcal{G}}^j$ and $\bm{\mathcal{F}}^j$. For each patch, we perform a local feature aggregation to refine the VLM representations ${\bm{\mathcal{F}}}$. The superpoints then undergo a process of global aggregation. A global-to-local aggregation process is applied to update the per-point features. Lastly, we employ the VLM feature anchors to further refine per-point features, which are then ready to be utilized for downstream tasks.
  • Figure 3: T-SNE embeddings of (a) PointCLIPv2 zhu2023pointclip and (b) GeoZe on ModelNet40. GeoZe produces better separated and grouped clusters for different categories, as evidenced by the superior silhouette coefficient (SC) and greater inter-cluster distance (inter), alongside a smaller intra-cluster distance (intra).
  • Figure 4: Zero-shot part segmentation results on ShapeNetPart yi2016scalable. (top row) ground-truth annotations, (middle row) PointCLIPv2 zhu2023pointclip, and (bottom row) GeoZe. Parts segmented by GeoZe are more homogeneous than those segmented by PointCLIPv2.
  • Figure 5: Zero-shot semantic segmentation results on ScanNet dai2017scannet using OpenSeg feature extraction (Tab. \ref{['tab:indoor_sem_seg']}). (top row) ground-truth annotations, (middle row) OpenScene (OpenSeg), and (bottom row) GeoZe.
  • ...and 6 more figures