Table of Contents
Fetching ...

GTNet: Graph Transformer Network for 3D Point Cloud Classification and Semantic Segmentation

Wei Zhou, Qian Wang, Weiwei Jin, Xinzhe Shi, Ying He

TL;DR

GTNet addresses limitations of static graphs and global-only Transformers in 3D point-cloud learning by fusing dynamic graph construction with Local Transformer (intra-domain cross-attention over a learned neighborhood) and Global Transformer (global self-attention). It introduces Graph Transformer blocks that update graphs across layers, uses edge-encoded local geometry, and employs residual connections to stabilize training, achieving strong results on ModelNet40, ShapeNet Part, and S3DIS. Ablation analyses validate the necessity of local and global attention, feature encoding, and dynamic graph updates. The approach demonstrates robust, scalable performance for classification and segmentation, with competitive complexity and clear avenues for multi-scale extensions.

Abstract

Recently, graph-based and Transformer-based deep learning networks have demonstrated excellent performances on various point cloud tasks. Most of the existing graph methods are based on static graph, which take a fixed input to establish graph relations. Moreover, many graph methods apply maximization and averaging to aggregate neighboring features, so that only a single neighboring point affects the feature of centroid or different neighboring points have the same influence on the centroid's feature, which ignoring the correlation and difference between points. Most Transformer-based methods extract point cloud features based on global attention and lack the feature learning on local neighbors. To solve the problems of these two types of models, we propose a new feature extraction block named Graph Transformer and construct a 3D point point cloud learning network called GTNet to learn features of point clouds on local and global patterns. Graph Transformer integrates the advantages of graph-based and Transformer-based methods, and consists of Local Transformer and Global Transformer modules. Local Transformer uses a dynamic graph to calculate all neighboring point weights by intra-domain cross-attention with dynamically updated graph relations, so that every neighboring point could affect the features of centroid with different weights; Global Transformer enlarges the receptive field of Local Transformer by a global self-attention. In addition, to avoid the disappearance of the gradient caused by the increasing depth of network, we conduct residual connection for centroid features in GTNet; we also adopt the features of centroid and neighbors to generate the local geometric descriptors in Local Transformer to strengthen the local information learning capability of the model. Finally, we use GTNet for shape classification, part segmentation and semantic segmentation tasks in this paper.

GTNet: Graph Transformer Network for 3D Point Cloud Classification and Semantic Segmentation

TL;DR

GTNet addresses limitations of static graphs and global-only Transformers in 3D point-cloud learning by fusing dynamic graph construction with Local Transformer (intra-domain cross-attention over a learned neighborhood) and Global Transformer (global self-attention). It introduces Graph Transformer blocks that update graphs across layers, uses edge-encoded local geometry, and employs residual connections to stabilize training, achieving strong results on ModelNet40, ShapeNet Part, and S3DIS. Ablation analyses validate the necessity of local and global attention, feature encoding, and dynamic graph updates. The approach demonstrates robust, scalable performance for classification and segmentation, with competitive complexity and clear avenues for multi-scale extensions.

Abstract

Recently, graph-based and Transformer-based deep learning networks have demonstrated excellent performances on various point cloud tasks. Most of the existing graph methods are based on static graph, which take a fixed input to establish graph relations. Moreover, many graph methods apply maximization and averaging to aggregate neighboring features, so that only a single neighboring point affects the feature of centroid or different neighboring points have the same influence on the centroid's feature, which ignoring the correlation and difference between points. Most Transformer-based methods extract point cloud features based on global attention and lack the feature learning on local neighbors. To solve the problems of these two types of models, we propose a new feature extraction block named Graph Transformer and construct a 3D point point cloud learning network called GTNet to learn features of point clouds on local and global patterns. Graph Transformer integrates the advantages of graph-based and Transformer-based methods, and consists of Local Transformer and Global Transformer modules. Local Transformer uses a dynamic graph to calculate all neighboring point weights by intra-domain cross-attention with dynamically updated graph relations, so that every neighboring point could affect the features of centroid with different weights; Global Transformer enlarges the receptive field of Local Transformer by a global self-attention. In addition, to avoid the disappearance of the gradient caused by the increasing depth of network, we conduct residual connection for centroid features in GTNet; we also adopt the features of centroid and neighbors to generate the local geometric descriptors in Local Transformer to strengthen the local information learning capability of the model. Finally, we use GTNet for shape classification, part segmentation and semantic segmentation tasks in this paper.
Paper Structure (15 sections, 13 equations, 8 figures, 11 tables, 1 algorithm)

This paper contains 15 sections, 13 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: Updating process of dynamic graph in coordinate space and feature spaces. The figure shows the dynamic graph establishment of three centroids, the neighboring points $K$ of centroids are set to 4 when performing K-NN, $p_{i}$ ($i=1,2,3$) are the coordinates of points, $f^{'}_{i}$ and $f^{"}_{i}$ are the features of points, where $f^{"}_{i}$ is the deep feature of $f^{'}_{i}$.
  • Figure 2: The specific architecture diagram of Graph Transformer. The feature extraction network consists of four feature extraction blocks named Graph Transformer, which is composed of Local Transformer and Global Transformer. The Local Transformer uses the intra-domain cross-attention mechanism based on the dynamic graph structure to obtain local features of the point clouds, and the Global Transformer uses the global self-attention mechanism to obtain global features of the point clouds. In shape classification task, the output dimensions of each feature extraction layer are 64, 64,128,256, respectively. In part segmentation and semantic segmentation tasks, the output dimension of each feature extraction layer is 96, and after the last feature extraction layer, the outputs of the four layers are concatenated to obtain 384 dimensional features. Coordinate Adjustment Network is used to enhance the invariance of rotation and translation.
  • Figure 3: The process of graph generation and feature encoding. We regard all points as centroids, perform K-NN on all centroids in their respective neighborhood, set $K$ to 4, and finally obtain ${\boldsymbol F}_{n e i g h b o r}$ composed of neighboring point features and $\boldsymbol{E}$ composed of edge features.
  • Figure 4: Two types of attention in local Transformer and global Transformer. Figure (a) shows centroids adopt Local Transformer to generate local fine-grained feature in centroids' neighborhood, the connection between each centroids and their neighboring points is considered as edges. The input of the Global Transformer in Figure (b) is the local features of the centroid after the aggregation of the neighborhood features, and the global features of a centroid rely on all centroids.
  • Figure 5: Structure of Local Transformer. Local Transformer firstly uses the dynamic graph to obtain the neighboring points by K-NN, and then conduct weighted summation of features for different neighboring points which are with edge relations. ${\boldsymbol F}^{'}$ is the feature encoding generated by the edge relations $\boldsymbol{E}$, which enhance the perception of local shapes, $K$ is the number of neighbor points, $C$ is the dimension of the input features, and $D$ is the dimension of the generated features.
  • ...and 3 more figures