Table of Contents
Fetching ...

Representation Learning for Point Cloud Understanding

Siming Yan

TL;DR

This work surveys and advances representation learning for 3D point clouds by integrating supervised primitive segmentation, self-supervised learning, and 2D-to-3D transfer. It introduces HPNet, a hybrid-representation network for primitive segmentation that fuses semantic and spectral cues with adaptive weighting and mean-shift clustering. It then proposes an asymmetric Implicit AutoEncoder (IAE) to address sampling variations in self-supervised learning, and a masked-3D feature prediction approach (MaskFeat3D) that emphasizes recovering high-order point features rather than point positions. Finally, it demonstrates a transfer-learning framework (MVNet) that leverages pre-trained 2D models via multi-view projection and cross-view consistency to boost 3D understanding. Across extensive experiments on benchmarks like ModelNet40, ScanObjectNN, ShapeNetPart, ScanNet, and SUN RGB-D, the methods show robust gains in classification, detection, and segmentation, highlighting practical benefits for 3D scene understanding and autonomous systems.

Abstract

With the rapid advancement of technology, 3D data acquisition and utilization have become increasingly prevalent across various fields, including computer vision, robotics, and geospatial analysis. 3D data, captured through methods such as 3D scanners, LiDARs, and RGB-D cameras, provides rich geometric, shape, and scale information. When combined with 2D images, 3D data offers machines a comprehensive understanding of their environment, benefiting applications like autonomous driving, robotics, remote sensing, and medical treatment. This dissertation focuses on three main areas: supervised representation learning for point cloud primitive segmentation, self-supervised learning methods, and transfer learning from 2D to 3D. Our approach, which integrates pre-trained 2D models to support 3D network training, significantly improves 3D understanding without merely transforming 2D data. Extensive experiments validate the effectiveness of our methods, showcasing their potential to advance point cloud representation learning by effectively integrating 2D knowledge.

Representation Learning for Point Cloud Understanding

TL;DR

This work surveys and advances representation learning for 3D point clouds by integrating supervised primitive segmentation, self-supervised learning, and 2D-to-3D transfer. It introduces HPNet, a hybrid-representation network for primitive segmentation that fuses semantic and spectral cues with adaptive weighting and mean-shift clustering. It then proposes an asymmetric Implicit AutoEncoder (IAE) to address sampling variations in self-supervised learning, and a masked-3D feature prediction approach (MaskFeat3D) that emphasizes recovering high-order point features rather than point positions. Finally, it demonstrates a transfer-learning framework (MVNet) that leverages pre-trained 2D models via multi-view projection and cross-view consistency to boost 3D understanding. Across extensive experiments on benchmarks like ModelNet40, ScanObjectNN, ShapeNetPart, ScanNet, and SUN RGB-D, the methods show robust gains in classification, detection, and segmentation, highlighting practical benefits for 3D scene understanding and autonomous systems.

Abstract

With the rapid advancement of technology, 3D data acquisition and utilization have become increasingly prevalent across various fields, including computer vision, robotics, and geospatial analysis. 3D data, captured through methods such as 3D scanners, LiDARs, and RGB-D cameras, provides rich geometric, shape, and scale information. When combined with 2D images, 3D data offers machines a comprehensive understanding of their environment, benefiting applications like autonomous driving, robotics, remote sensing, and medical treatment. This dissertation focuses on three main areas: supervised representation learning for point cloud primitive segmentation, self-supervised learning methods, and transfer learning from 2D to 3D. Our approach, which integrates pre-trained 2D models to support 3D network training, significantly improves 3D understanding without merely transforming 2D data. Extensive experiments validate the effectiveness of our methods, showcasing their potential to advance point cloud representation learning by effectively integrating 2D knowledge.

Paper Structure

This paper contains 147 sections, 2 theorems, 36 equations, 24 figures, 29 tables.

Key Result

Proposition 4.1.1

Let $Q\in \mathbb{R}^{n\times m}$ collect the top-$m$ eigenvectors of the convariance matrix $C = \sum\limits_{k=1}^{N} {x}_k {x}_{k}'$. Then under the assumption that $\epsilon_k\in {L}^{\perp},1\leq k \leq N$, $Q^{\star} = Q.$

Figures (24)

  • Figure 1: HPNet takes a point cloud as input and outputs detected primitive patches. It can handle diverse primitives at different scales. The detected primitives have smooth boundaries.
  • Figure 2: Overview of our approach pipeline. HPNet consists of three modules: (1) Dense Descriptor takes a point cloud and optional normal as input and outputs a semantic feature descriptor, a type indicator vector, and a shape parameter vector. (2) Spectral Embedding Module takes dense descriptors as input and builds geometric consistency matrix $A_c$ and smoothness matrix $A_s$. Then it outputs consistency feature $U_c$ and smoothness feature $U_s$. (3) Clustering Module combines three features with adaptive weights and use mean-shift clustering to output the segmentation result.
  • Figure 3: Primitive segmentation results with different methods. From top to down, we show the results of ground truth, SPFN li2019supervised, ParseNet SharmaLMKCM20, and Our approach.
  • Figure 4: Mean IoU of segmentation results on different primitive types. Here, open-b and close-b represent open and closed B-spline patches. (a): comparison between HPNet and baseline methods. (b): comparison between different components of HPNet.
  • Figure 5: Examples on comparison between with and without sharp edge descriptor. Here, 'Ours-ns' represents our model without combining sharp edge descriptor. We notice that adding sharp edge descriptor helps model to capture boundary better.
  • ...and 19 more figures

Theorems & Definitions (3)

  • Proposition 4.1.1
  • Definition 4.1.1
  • Proposition 4.1.2