GSTran: Joint Geometric and Semantic Coherence for Point Cloud Segmentation
Abiao Li, Chenlei Lv, Guofeng Mei, Yifan Zuo, Jian Zhang, Yuming Fang
TL;DR
This paper addresses the challenge of learning both strong local geometry and long-range semantic context in point cloud segmentation. It introduces GSTran, a two-transformer architecture consisting of a Local Geometric Transformer that uses the tangent-plane discrepancy $d_{tan}$ with $weight_{geo}=\exp(-d_{tan})$ to emphasize geometrically similar neighbors, and a Global Semantic Transformer that employs a multi-head voting scheme to refine long-range similarities via a global mask. The approach achieves state-of-the-art or competitive results on ShapeNetPart and S3DIS for both semantic and part segmentation, with ablations confirming the complementary roles of the local geometry and global voting components. The work provides a practical, geometry-aware, context-rich framework for accurate point cloud segmentation and releases its code for reproducibility and broader adoption.
Abstract
Learning meaningful local and global information remains a challenge in point cloud segmentation tasks. When utilizing local information, prior studies indiscriminately aggregates neighbor information from different classes to update query points, potentially compromising the distinctive feature of query points. In parallel, inaccurate modeling of long-distance contextual dependencies when utilizing global information can also impact model performance. To address these issues, we propose GSTran, a novel transformer network tailored for the segmentation task. The proposed network mainly consists of two principal components: a local geometric transformer and a global semantic transformer. In the local geometric transformer module, we explicitly calculate the geometric disparity within the local region. This enables amplifying the affinity with geometrically similar neighbor points while suppressing the association with other neighbors. In the global semantic transformer module, we design a multi-head voting strategy. This strategy evaluates semantic similarity across the entire spatial range, facilitating the precise capture of contextual dependencies. Experiments on ShapeNetPart and S3DIS benchmarks demonstrate the effectiveness of the proposed method, showing its superiority over other algorithms. The code is available at https://github.com/LAB123-tech/GSTran.
