Table of Contents
Fetching ...

SWCF-Net: Similarity-weighted Convolution and Local-global Fusion for Efficient Large-scale Point Cloud Semantic Segmentation

Zhenchao Lin, Li He, Hongqiang Yang, Xiaoqun Sun, Cuojin Zhang, Weinan Chen, Yisheng Guan, Hong Zhang

TL;DR

SWCF-Net tackles the challenge of efficient semantic segmentation for large-scale point clouds by jointly modeling local geometry and global semantic context. It introduces a similarity-weighted local operator (SWConv), a lightweight Average Transformer for global context with linearized attention via downsampling of the $K$ and $V$ channels, and an orthogonal fusion mechanism to fuse local and global features without redundancy. The approach achieves competitive mIoU on SemanticKITTI and Toronto3D while reducing computational cost and memory usage compared to traditional Transformer-based methods. These contributions enable scalable, accurate segmentation of large outdoor scenes with practical runtime and resource requirements.

Abstract

Large-scale point cloud consists of a multitude of individual objects, thereby encompassing rich structural and underlying semantic contextual information, resulting in a challenging problem in efficiently segmenting a point cloud. Most existing researches mainly focus on capturing intricate local features without giving due consideration to global ones, thus failing to leverage semantic context. In this paper, we propose a Similarity-Weighted Convolution and local-global Fusion Network, named SWCF-Net, which takes into account both local and global features. We propose a Similarity-Weighted Convolution (SWConv) to effectively extract local features, where similarity weights are incorporated into the convolution operation to enhance the generalization capabilities. Then, we employ a downsampling operation on the K and V channels within the attention module, thereby reducing the quadratic complexity to linear, enabling the Transformer to deal with large-scale point clouds. At last, orthogonal components are extracted in the global features and then aggregated with local features, thereby eliminating redundant information between local and global features and consequently promoting efficiency. We evaluate SWCF-Net on large-scale outdoor datasets SemanticKITTI and Toronto3D. Our experimental results demonstrate the effectiveness of the proposed network. Our method achieves a competitive result with less computational cost, and is able to handle large-scale point clouds efficiently.

SWCF-Net: Similarity-weighted Convolution and Local-global Fusion for Efficient Large-scale Point Cloud Semantic Segmentation

TL;DR

SWCF-Net tackles the challenge of efficient semantic segmentation for large-scale point clouds by jointly modeling local geometry and global semantic context. It introduces a similarity-weighted local operator (SWConv), a lightweight Average Transformer for global context with linearized attention via downsampling of the and channels, and an orthogonal fusion mechanism to fuse local and global features without redundancy. The approach achieves competitive mIoU on SemanticKITTI and Toronto3D while reducing computational cost and memory usage compared to traditional Transformer-based methods. These contributions enable scalable, accurate segmentation of large outdoor scenes with practical runtime and resource requirements.

Abstract

Large-scale point cloud consists of a multitude of individual objects, thereby encompassing rich structural and underlying semantic contextual information, resulting in a challenging problem in efficiently segmenting a point cloud. Most existing researches mainly focus on capturing intricate local features without giving due consideration to global ones, thus failing to leverage semantic context. In this paper, we propose a Similarity-Weighted Convolution and local-global Fusion Network, named SWCF-Net, which takes into account both local and global features. We propose a Similarity-Weighted Convolution (SWConv) to effectively extract local features, where similarity weights are incorporated into the convolution operation to enhance the generalization capabilities. Then, we employ a downsampling operation on the K and V channels within the attention module, thereby reducing the quadratic complexity to linear, enabling the Transformer to deal with large-scale point clouds. At last, orthogonal components are extracted in the global features and then aggregated with local features, thereby eliminating redundant information between local and global features and consequently promoting efficiency. We evaluate SWCF-Net on large-scale outdoor datasets SemanticKITTI and Toronto3D. Our experimental results demonstrate the effectiveness of the proposed network. Our method achieves a competitive result with less computational cost, and is able to handle large-scale point clouds efficiently.
Paper Structure (16 sections, 7 equations, 7 figures, 5 tables)

This paper contains 16 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Demonstration of 3D LiDAR segmentation. In large-scale point clouds, semantic context is essential for segmentation. Many existing segmentation methods employ local features only and do not consider the global information, which is highly related with the semantic context.
  • Figure 2: The architecture of our SWCF-Net. Our model adopts an encoder-decoder architecture, where the encoder consists of a local encoder, global encoder, and fusion module. The local encoder adopts the proposed similarity-weighted 3D CNN to capture local features. The global encoder uses a lightweight Transformer to capture global features. The fusion module employs orthogonal fusion to effectively integrate both local and global features.
  • Figure 3: The proposed local encoding module. The Similarity-Weighted Convolution (SWConv) component is designed to apply weighted filtering to point cloud features at local regions.
  • Figure 4: The proposed gloabl encoding module. The Average Transformer relies on the down sample module to reduce the number of points in the $K$ and $V$ channels of the attention mechanism, from $N$ down to $P$, where $P \ll N$. Black points represent the raw point cloud, red points indicate the down-sampled points after FPS, green points denote neighboring points of the sampled points, and purple points represent the feature points obtained through averaging within the local area. Consequently, this enables the avoidance of the quadratic complexity inherent in traditional attention mechanisms.
  • Figure 5: The proposed fusion module. Orthogonal components are extracted in global features and then aggregated with local features, thereby eliminating redundant information between local and global features.
  • ...and 2 more figures