Table of Contents
Fetching ...

Soft Masked Transformer for Point Cloud Processing with Skip Attention-Based Upsampling

Yong He, Hongshan Yu, Chaoxu Mu, Mingtao Feng, Tongjia Chen, Zechuan Li, Anwaar Ulhaq, Ajmal Mian

TL;DR

This work addresses the challenge of leveraging task-level context in 3D point-cloud processing by introducing SMTransformer, which injects task priors through a soft mask into vector attention, enabling boundary-aware feature learning. It further couples encoder and decoder layers with a Skip-Attention Up-sampling Block to dynamically fuse cross-resolution features, and reduces parameter overhead via a Shared Point Position Encoding strategy. The approach achieves state-of-the-art or competitive semantic segmentation performance on indoor and outdoor benchmarks (e.g., S3DIS Area 5 and SWAN) while maintaining a compact model, and demonstrates strong robustness to density variations, perturbations, and noise. Collectively, these components advance practical 3D perception for robotics and automation by improving accuracy and efficiency in point-cloud tasks.

Abstract

Point cloud processing methods leverage local and global point features %at the feature level to cater to downstream tasks, yet they often overlook the task-level context inherent in point clouds during the encoding stage. We argue that integrating task-level information into the encoding stage significantly enhances performance. To that end, we propose SMTransformer which incorporates task-level information into a vector-based transformer by utilizing a soft mask generated from task-level queries and keys to learn the attention weights. Additionally, to facilitate effective communication between features from the encoding and decoding layers in high-level tasks such as segmentation, we introduce a skip-attention-based up-sampling block. This block dynamically fuses features from various resolution points across the encoding and decoding layers. To mitigate the increase in network parameters and training time resulting from the complexity of the aforementioned blocks, we propose a novel shared position encoding strategy. This strategy allows various transformer blocks to share the same position information over the same resolution points, thereby reducing network parameters and training time without compromising accuracy.Experimental comparisons with existing methods on multiple datasets demonstrate the efficacy of SMTransformer and skip-attention-based up-sampling for point cloud processing tasks, including semantic segmentation and classification. In particular, we achieve state-of-the-art semantic segmentation results of 73.4% mIoU on S3DIS Area 5 and 62.4% mIoU on SWAN dataset

Soft Masked Transformer for Point Cloud Processing with Skip Attention-Based Upsampling

TL;DR

This work addresses the challenge of leveraging task-level context in 3D point-cloud processing by introducing SMTransformer, which injects task priors through a soft mask into vector attention, enabling boundary-aware feature learning. It further couples encoder and decoder layers with a Skip-Attention Up-sampling Block to dynamically fuse cross-resolution features, and reduces parameter overhead via a Shared Point Position Encoding strategy. The approach achieves state-of-the-art or competitive semantic segmentation performance on indoor and outdoor benchmarks (e.g., S3DIS Area 5 and SWAN) while maintaining a compact model, and demonstrates strong robustness to density variations, perturbations, and noise. Collectively, these components advance practical 3D perception for robotics and automation by improving accuracy and efficiency in point-cloud tasks.

Abstract

Point cloud processing methods leverage local and global point features %at the feature level to cater to downstream tasks, yet they often overlook the task-level context inherent in point clouds during the encoding stage. We argue that integrating task-level information into the encoding stage significantly enhances performance. To that end, we propose SMTransformer which incorporates task-level information into a vector-based transformer by utilizing a soft mask generated from task-level queries and keys to learn the attention weights. Additionally, to facilitate effective communication between features from the encoding and decoding layers in high-level tasks such as segmentation, we introduce a skip-attention-based up-sampling block. This block dynamically fuses features from various resolution points across the encoding and decoding layers. To mitigate the increase in network parameters and training time resulting from the complexity of the aforementioned blocks, we propose a novel shared position encoding strategy. This strategy allows various transformer blocks to share the same position information over the same resolution points, thereby reducing network parameters and training time without compromising accuracy.Experimental comparisons with existing methods on multiple datasets demonstrate the efficacy of SMTransformer and skip-attention-based up-sampling for point cloud processing tasks, including semantic segmentation and classification. In particular, we achieve state-of-the-art semantic segmentation results of 73.4% mIoU on S3DIS Area 5 and 62.4% mIoU on SWAN dataset
Paper Structure (18 sections, 13 equations, 7 figures, 9 tables)

This paper contains 18 sections, 13 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: (a) Network architecture for semantic segmentation. (b) Soft masked transformer block and (c) skip attention-based up-sampling block.
  • Figure 2: Comparison of the attention, position encoding in Transformers. (a) The vector attention with position encoding bias in Point Transformer, see Eq.\ref{['eq:pointtransformer']}. (b) The vector attention with position encoding multiplier in Point Transformer V2, see Eq.\ref{['eq:pointtransformerv2']}. (c) The vector attention with soft mask and enhanced position encoding bias in our proposed SMTransformer, see Eq.\ref{['eq:smpointtransformer']}. The different parts are coloured in blue.
  • Figure 3: Comparison of (a) Transition up and (b) Skip attention-based up-sampling. 'interpo.' stands for the interpolation operation. 'grid up' stands for the grid-based unpooling. 'SA' denotes the skip attention. The different parts are coloured in blue.
  • Figure 4: (a) Unshared point position encoding: Various transformer blocks (coloured in grey) within the same encoding or decoding layer (coloured in yellow), operating over the same resolution point cloud, require different position information (coloured in blue). (b) Shared point position encoding: Various transformer blocks within the encoding and decoding layers share position information over the same resolution point cloud.
  • Figure 5: Visualization of semantic segmentation results on S3DIS Area-5. The red boxes highlight the object boundaries in the scenes where our proposed SMTransformer performs particularly better than the Point Transformer V2 (PTv2).
  • ...and 2 more figures