Table of Contents
Fetching ...

GSTran: Joint Geometric and Semantic Coherence for Point Cloud Segmentation

Abiao Li, Chenlei Lv, Guofeng Mei, Yifan Zuo, Jian Zhang, Yuming Fang

TL;DR

This paper addresses the challenge of learning both strong local geometry and long-range semantic context in point cloud segmentation. It introduces GSTran, a two-transformer architecture consisting of a Local Geometric Transformer that uses the tangent-plane discrepancy $d_{tan}$ with $weight_{geo}=\exp(-d_{tan})$ to emphasize geometrically similar neighbors, and a Global Semantic Transformer that employs a multi-head voting scheme to refine long-range similarities via a global mask. The approach achieves state-of-the-art or competitive results on ShapeNetPart and S3DIS for both semantic and part segmentation, with ablations confirming the complementary roles of the local geometry and global voting components. The work provides a practical, geometry-aware, context-rich framework for accurate point cloud segmentation and releases its code for reproducibility and broader adoption.

Abstract

Learning meaningful local and global information remains a challenge in point cloud segmentation tasks. When utilizing local information, prior studies indiscriminately aggregates neighbor information from different classes to update query points, potentially compromising the distinctive feature of query points. In parallel, inaccurate modeling of long-distance contextual dependencies when utilizing global information can also impact model performance. To address these issues, we propose GSTran, a novel transformer network tailored for the segmentation task. The proposed network mainly consists of two principal components: a local geometric transformer and a global semantic transformer. In the local geometric transformer module, we explicitly calculate the geometric disparity within the local region. This enables amplifying the affinity with geometrically similar neighbor points while suppressing the association with other neighbors. In the global semantic transformer module, we design a multi-head voting strategy. This strategy evaluates semantic similarity across the entire spatial range, facilitating the precise capture of contextual dependencies. Experiments on ShapeNetPart and S3DIS benchmarks demonstrate the effectiveness of the proposed method, showing its superiority over other algorithms. The code is available at https://github.com/LAB123-tech/GSTran.

GSTran: Joint Geometric and Semantic Coherence for Point Cloud Segmentation

TL;DR

This paper addresses the challenge of learning both strong local geometry and long-range semantic context in point cloud segmentation. It introduces GSTran, a two-transformer architecture consisting of a Local Geometric Transformer that uses the tangent-plane discrepancy with to emphasize geometrically similar neighbors, and a Global Semantic Transformer that employs a multi-head voting scheme to refine long-range similarities via a global mask. The approach achieves state-of-the-art or competitive results on ShapeNetPart and S3DIS for both semantic and part segmentation, with ablations confirming the complementary roles of the local geometry and global voting components. The work provides a practical, geometry-aware, context-rich framework for accurate point cloud segmentation and releases its code for reproducibility and broader adoption.

Abstract

Learning meaningful local and global information remains a challenge in point cloud segmentation tasks. When utilizing local information, prior studies indiscriminately aggregates neighbor information from different classes to update query points, potentially compromising the distinctive feature of query points. In parallel, inaccurate modeling of long-distance contextual dependencies when utilizing global information can also impact model performance. To address these issues, we propose GSTran, a novel transformer network tailored for the segmentation task. The proposed network mainly consists of two principal components: a local geometric transformer and a global semantic transformer. In the local geometric transformer module, we explicitly calculate the geometric disparity within the local region. This enables amplifying the affinity with geometrically similar neighbor points while suppressing the association with other neighbors. In the global semantic transformer module, we design a multi-head voting strategy. This strategy evaluates semantic similarity across the entire spatial range, facilitating the precise capture of contextual dependencies. Experiments on ShapeNetPart and S3DIS benchmarks demonstrate the effectiveness of the proposed method, showing its superiority over other algorithms. The code is available at https://github.com/LAB123-tech/GSTran.
Paper Structure (10 sections, 3 equations, 7 figures, 6 tables)

This paper contains 10 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview of the proposed model. In the encoder and decoder stages, the transformer structure serves as the primary feature aggregator throughout the network. MLP: Multi-layer perception. N: The number of points in the point cloud.
  • Figure 2: Structure of the local geometric transformer module. We visualize the local weight on an airplane, with a red query point located on the wing. In weight$_{ours}$, the weights of neighbor points in the wing decay slowly as the distance increases. However, the weights of other neighbor points decay rapidly.
  • Figure 3: Illustration of the distance $d_{tan1}$ and $d_{tan2}$ from $p_1$ to the tangent plane of $q_{1}$ and $q_{2}$, respectively. Although the Euclidean distance from $p_1$ to both $q_1$ and $q_2$ remain the same, $p_1$ is closer to $t_2$, signifying that $q_2$ holds greater significance than $q_1$.
  • Figure 4: Overview of the global semantic transformer module. We visualize the refined similarity on an airplane, with a red query point located at the wing. In the refined similarity generated by our method, high response weights are exclusively assigned to points belonging to the wing section.
  • Figure 5: Visualization of segmentation results for different methods on S3DIS. The red box indicates the region where the segmentation error occurs.
  • ...and 2 more figures