Table of Contents
Fetching ...

PointCubeNet: 3D Part-level Reasoning with 3x3x3 Point Cloud Blocks

Da-Yeong Kim, Yeong-Jun Cho

TL;DR

PointCubeNet addresses unsupervised 3D part-level reasoning by jointly learning global and local representations from raw point clouds and aligning them with text descriptions generated by a large language model. It avoids 3D-to-2D projections and pretrained image-language encoders, instead using a local 27-block branch with self- and cross-attention and InfoNCE-based contrastive losses to connect visual and textual modalities. The key contributions include first unsupervised 3D part-level reasoning over 27 local blocks, a soft local loss that handles symmetry, and zero-shot part-level reasoning demonstrated on ModelNet and ShapeNet. The results show improved object understanding when local parts are modeled and robust cross-domain performance without manual part annotations.

Abstract

In this paper, we propose PointCubeNet, a novel multi-modal 3D understanding framework that achieves part-level reasoning without requiring any part annotations. PointCubeNet comprises global and local branches. The proposed local branch, structured into 3x3x3 local blocks, enables part-level analysis of point cloud sub-regions with the corresponding local text labels. Leveraging the proposed pseudo-labeling method and local loss function, PointCubeNet is effectively trained in an unsupervised manner. The experimental results demonstrate that understanding 3D object parts enhances the understanding of the overall 3D object. In addition, this is the first attempt to perform unsupervised 3D part-level reasoning and achieves reliable and meaningful results.

PointCubeNet: 3D Part-level Reasoning with 3x3x3 Point Cloud Blocks

TL;DR

PointCubeNet addresses unsupervised 3D part-level reasoning by jointly learning global and local representations from raw point clouds and aligning them with text descriptions generated by a large language model. It avoids 3D-to-2D projections and pretrained image-language encoders, instead using a local 27-block branch with self- and cross-attention and InfoNCE-based contrastive losses to connect visual and textual modalities. The key contributions include first unsupervised 3D part-level reasoning over 27 local blocks, a soft local loss that handles symmetry, and zero-shot part-level reasoning demonstrated on ModelNet and ShapeNet. The results show improved object understanding when local parts are modeled and robust cross-domain performance without manual part annotations.

Abstract

In this paper, we propose PointCubeNet, a novel multi-modal 3D understanding framework that achieves part-level reasoning without requiring any part annotations. PointCubeNet comprises global and local branches. The proposed local branch, structured into 3x3x3 local blocks, enables part-level analysis of point cloud sub-regions with the corresponding local text labels. Leveraging the proposed pseudo-labeling method and local loss function, PointCubeNet is effectively trained in an unsupervised manner. The experimental results demonstrate that understanding 3D object parts enhances the understanding of the overall 3D object. In addition, this is the first attempt to perform unsupervised 3D part-level reasoning and achieves reliable and meaningful results.

Paper Structure

This paper contains 14 sections, 13 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview and potential applications of the proposed PointCubeNet. This method analyzes the global and local features of the point cloud in 3×3×3 point cloud blocks. By leveraging this capability, it can understand both an entire 3D object (e.g., classification and reasoning) and its local parts in an unsupervised manner.
  • Figure 2: Classification results of CLIP CLIP. It understands the entire object well but fails to distinguish its parts.
  • Figure 3: The pipeline of PointCubeNet. It consists of a global branch and a local branch, with four main components: (1) a global 3D feature extractor, (2) a local 3D feature extractor, (3) global and local pseudo-labeling for 3D reasoning, and (4) a text embedding module. The network is trained to measure the similarity between global and local 3D feature embeddings and text embeddings. It performs 3D classification, reasoning, and part-level reasoning without training any head structures.
  • Figure 4: Three intuitive positions { pos} for each axis
  • Figure 5: Examples of positive and negative pairs assignments using $P(j,k)$. Each local 3D feature corresponds to three positive and six negative text-embedding pairs.
  • ...and 5 more figures