Table of Contents
Fetching ...

GPSFormer: A Global Perception and Local Structure Fitting-based Transformer for Point Cloud Understanding

Changshuo Wang, Meiqing Wu, Siew-Kei Lam, Xin Ning, Shangshu Yu, Ruiping Wang, Weijun Li, Thambipillai Srikanthan

TL;DR

GPSFormer tackles the challenge of extracting rich shape information from irregular point clouds without external data by introducing a Transformer framework built around a Global Perception Module (GPM) and a Taylor-series-inspired Local Structure Fitting Convolution (LSFConv). The GPM combines Adaptive Deformable Graph Convolution to model short-range feature-space dependencies with cross-attention and multi-head attention to capture long-range context, while LSFConv decomposes local structure into low- and high-frequency components to refine local geometry. The approach yields state-of-the-art or competitive results across 3D shape classification, part segmentation, and few-shot learning on real and synthetic benchmarks, with a compact parameter budget (~2.36M) and low FLOPs (~0.7G). This work demonstrates that explicit global-local modeling without external data can achieve robust point cloud understanding and offers avenues for efficient pre-training and few-shot extensions.

Abstract

Despite the significant advancements in pre-training methods for point cloud understanding, directly capturing intricate shape information from irregular point clouds without reliance on external data remains a formidable challenge. To address this problem, we propose GPSFormer, an innovative Global Perception and Local Structure Fitting-based Transformer, which learns detailed shape information from point clouds with remarkable precision. The core of GPSFormer is the Global Perception Module (GPM) and the Local Structure Fitting Convolution (LSFConv). Specifically, GPM utilizes Adaptive Deformable Graph Convolution (ADGConv) to identify short-range dependencies among similar features in the feature space and employs Multi-Head Attention (MHA) to learn long-range dependencies across all positions within the feature space, ultimately enabling flexible learning of contextual representations. Inspired by Taylor series, we design LSFConv, which learns both low-order fundamental and high-order refinement information from explicitly encoded local geometric structures. Integrating the GPM and LSFConv as fundamental components, we construct GPSFormer, a cutting-edge Transformer that effectively captures global and local structures of point clouds. Extensive experiments validate GPSFormer's effectiveness in three point cloud tasks: shape classification, part segmentation, and few-shot learning. The code of GPSFormer is available at \url{https://github.com/changshuowang/GPSFormer}.

GPSFormer: A Global Perception and Local Structure Fitting-based Transformer for Point Cloud Understanding

TL;DR

GPSFormer tackles the challenge of extracting rich shape information from irregular point clouds without external data by introducing a Transformer framework built around a Global Perception Module (GPM) and a Taylor-series-inspired Local Structure Fitting Convolution (LSFConv). The GPM combines Adaptive Deformable Graph Convolution to model short-range feature-space dependencies with cross-attention and multi-head attention to capture long-range context, while LSFConv decomposes local structure into low- and high-frequency components to refine local geometry. The approach yields state-of-the-art or competitive results across 3D shape classification, part segmentation, and few-shot learning on real and synthetic benchmarks, with a compact parameter budget (~2.36M) and low FLOPs (~0.7G). This work demonstrates that explicit global-local modeling without external data can achieve robust point cloud understanding and offers avenues for efficient pre-training and few-shot extensions.

Abstract

Despite the significant advancements in pre-training methods for point cloud understanding, directly capturing intricate shape information from irregular point clouds without reliance on external data remains a formidable challenge. To address this problem, we propose GPSFormer, an innovative Global Perception and Local Structure Fitting-based Transformer, which learns detailed shape information from point clouds with remarkable precision. The core of GPSFormer is the Global Perception Module (GPM) and the Local Structure Fitting Convolution (LSFConv). Specifically, GPM utilizes Adaptive Deformable Graph Convolution (ADGConv) to identify short-range dependencies among similar features in the feature space and employs Multi-Head Attention (MHA) to learn long-range dependencies across all positions within the feature space, ultimately enabling flexible learning of contextual representations. Inspired by Taylor series, we design LSFConv, which learns both low-order fundamental and high-order refinement information from explicitly encoded local geometric structures. Integrating the GPM and LSFConv as fundamental components, we construct GPSFormer, a cutting-edge Transformer that effectively captures global and local structures of point clouds. Extensive experiments validate GPSFormer's effectiveness in three point cloud tasks: shape classification, part segmentation, and few-shot learning. The code of GPSFormer is available at \url{https://github.com/changshuowang/GPSFormer}.
Paper Structure (28 sections, 16 equations, 6 figures, 7 tables)

This paper contains 28 sections, 16 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Performance comparison on the challenging ScanobjectNN dataset. We show supervised learning-based and pre-training-based methods with parameters less than 22M. The proposed supervised learning GPSFormer outperforms state-of-the-art methods, achieving an accuracy of 95.4% with a modest parameter of 2.36M.
  • Figure 2: Taylor series and schematic diagram of High-Order Convolution (HOConv).
  • Figure 3: Overall Architecture of GPSFormer. For classification (bottom), Three GPSFormer blocks run consecutively, followed by a max-pooling and a multi-layer perceptron. For segmentation (top), a U-net style architecture is adopted with GPSFormer blocks for downsampling and feature propagation for upsampling, followed by a multi-layer perceptron.
  • Figure 4: Classification results on the ScanObjectNN dataset. “-” denotes unknown. “*” denotes pre-training methods.
  • Figure 5: Classification results on ModelNet40 dataset. “-” denotes unknown. “*” denotes pre-training methods.
  • ...and 1 more figures