Table of Contents
Fetching ...

PCT: Point cloud transformer

Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R. Martin, Shi-Min Hu

TL;DR

The paper tackles learning from unordered, irregular point clouds by introducing Point Cloud Transformer (PCT), a Transformer-based framework tailored to 3D data. It cores on coordinate-based input embedding, an offset-attention mechanism inspired by Laplacian operators, and a neighbor embedding module to fuse local context, producing a four-layer attention encoder and a global feature via pooling. Demonstrating state-of-the-art performance on ModelNet40, ShapeNet, and S3DIS for classification, segmentation, and normal estimation, it also analyzes computational efficiency and variants with deeper local context. The work highlights the potential of Transformer architectures for 3D point clouds and suggests directions toward larger-scale training and generation-oriented tasks.

Abstract

The irregular domain and lack of ordering make it challenging to design deep neural networks for point cloud processing. This paper presents a novel framework named Point Cloud Transformer(PCT) for point cloud learning. PCT is based on Transformer, which achieves huge success in natural language processing and displays great potential in image processing. It is inherently permutation invariant for processing a sequence of points, making it well-suited for point cloud learning. To better capture local context within the point cloud, we enhance input embedding with the support of farthest point sampling and nearest neighbor search. Extensive experiments demonstrate that the PCT achieves the state-of-the-art performance on shape classification, part segmentation and normal estimation tasks.

PCT: Point cloud transformer

TL;DR

The paper tackles learning from unordered, irregular point clouds by introducing Point Cloud Transformer (PCT), a Transformer-based framework tailored to 3D data. It cores on coordinate-based input embedding, an offset-attention mechanism inspired by Laplacian operators, and a neighbor embedding module to fuse local context, producing a four-layer attention encoder and a global feature via pooling. Demonstrating state-of-the-art performance on ModelNet40, ShapeNet, and S3DIS for classification, segmentation, and normal estimation, it also analyzes computational efficiency and variants with deeper local context. The work highlights the potential of Transformer architectures for 3D point clouds and suggests directions toward larger-scale training and generation-oriented tasks.

Abstract

The irregular domain and lack of ordering make it challenging to design deep neural networks for point cloud processing. This paper presents a novel framework named Point Cloud Transformer(PCT) for point cloud learning. PCT is based on Transformer, which achieves huge success in natural language processing and displays great potential in image processing. It is inherently permutation invariant for processing a sequence of points, making it well-suited for point cloud learning. To better capture local context within the point cloud, we enhance input embedding with the support of farthest point sampling and nearest neighbor search. Extensive experiments demonstrate that the PCT achieves the state-of-the-art performance on shape classification, part segmentation and normal estimation tasks.

Paper Structure

This paper contains 17 sections, 10 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Attention map and part segmentation generated by PCT. First three columns: point-wise attention map for different query points (indicated by ☆), yellow to blue indicating increasing attention weight. Last column: part segmentation results.
  • Figure 2: PCT architecture. The encoder mainly comprises an Input Embedding module and four stacked Attention module. The decoder mainly comprises multiple Linear layers. Numbers above each module indicate its output channels. MA-Pool concatenates Max-Pool and Average-Pool. LBR combines Linear, BatchNorm and ReLU layers. LBRD means LBR followed by a Dropout layer.
  • Figure 3: Architecture of Offset-Attention. Numbers above tensors are numbers of dimensions $N$ and feature channels $D/D_a$, with switches showing alternatives of Self-Attention or Offset-Attention: dotted lines indicate Self-Attention branches.
  • Figure 4: Left: Neighbor Embedding architecture; Middle: SG Module with $N_{in}$ input points, $D_{in}$ input channels, $k$ neighbors, $N_{out}$ output sampled points and $D_{out}$ output channels; Top-right: example of sampling (colored balls represent sampled points); Bottom-right: example of grouping with $k$-NN neighbors; Number above LBR: number of output channels. Number above SG: number of sampled points and its output channels.
  • Figure 5: Segmentations from PointNet, NPCT, SPCT, PCT dnd Ground Truth(GT).