Table of Contents
Fetching ...

3D Learnable Supertoken Transformer for LiDAR Point Cloud Scene Segmentation

Dening Lu, Jun Zhou, Kyle Gao, Linlin Xu, Jonathan Li

TL;DR

LiDAR point cloud segmentation with 3D Transformers often suffers from inefficiency due to tokenization and reliance on precomputed superpoints. The paper introduces 3D Learnable Supertoken Transformer (3DLST), a framework that uses learnable supertytokens optimized by Dynamic Supertoken Optimization (DSO), enhanced by Deep Feature Enhancement (DFE) and Cross-Attention-guided Upsampling (CAU) within a novel W-net architecture. This approach eliminates costly preprocessing, achieves state-of-the-art results on MS-LiDAR, DALES, and Toronto-3D, and delivers up to several-fold faster performance than prior best methods. The work demonstrates strong generalization across airborne, aerial, and vehicle-mounted LiDAR data, offering a practical, scalable solution for large-scale 3D scene understanding.

Abstract

3D Transformers have achieved great success in point cloud understanding and representation. However, there is still considerable scope for further development in effective and efficient Transformers for large-scale LiDAR point cloud scene segmentation. This paper proposes a novel 3D Transformer framework, named 3D Learnable Supertoken Transformer (3DLST). The key contributions are summarized as follows. Firstly, we introduce the first Dynamic Supertoken Optimization (DSO) block for efficient token clustering and aggregating, where the learnable supertoken definition avoids the time-consuming pre-processing of traditional superpoint generation. Since the learnable supertokens can be dynamically optimized by multi-level deep features during network learning, they are tailored to the semantic homogeneity-aware token clustering. Secondly, an efficient Cross-Attention-guided Upsampling (CAU) block is proposed for token reconstruction from optimized supertokens. Thirdly, the 3DLST is equipped with a novel W-net architecture instead of the common U-net design, which is more suitable for Transformer-based feature learning. The SOTA performance on three challenging LiDAR datasets (airborne MultiSpectral LiDAR (MS-LiDAR) (89.3% of the average F1 score), DALES (80.2% of mIoU), and Toronto-3D dataset (80.4% of mIoU)) demonstrate the superiority of 3DLST and its strong adaptability to various LiDAR point cloud data (airborne MS-LiDAR, aerial LiDAR, and vehicle-mounted LiDAR data). Furthermore, 3DLST also achieves satisfactory results in terms of algorithm efficiency, which is up to 5x faster than previous best-performing methods.

3D Learnable Supertoken Transformer for LiDAR Point Cloud Scene Segmentation

TL;DR

LiDAR point cloud segmentation with 3D Transformers often suffers from inefficiency due to tokenization and reliance on precomputed superpoints. The paper introduces 3D Learnable Supertoken Transformer (3DLST), a framework that uses learnable supertytokens optimized by Dynamic Supertoken Optimization (DSO), enhanced by Deep Feature Enhancement (DFE) and Cross-Attention-guided Upsampling (CAU) within a novel W-net architecture. This approach eliminates costly preprocessing, achieves state-of-the-art results on MS-LiDAR, DALES, and Toronto-3D, and delivers up to several-fold faster performance than prior best methods. The work demonstrates strong generalization across airborne, aerial, and vehicle-mounted LiDAR data, offering a practical, scalable solution for large-scale 3D scene understanding.

Abstract

3D Transformers have achieved great success in point cloud understanding and representation. However, there is still considerable scope for further development in effective and efficient Transformers for large-scale LiDAR point cloud scene segmentation. This paper proposes a novel 3D Transformer framework, named 3D Learnable Supertoken Transformer (3DLST). The key contributions are summarized as follows. Firstly, we introduce the first Dynamic Supertoken Optimization (DSO) block for efficient token clustering and aggregating, where the learnable supertoken definition avoids the time-consuming pre-processing of traditional superpoint generation. Since the learnable supertokens can be dynamically optimized by multi-level deep features during network learning, they are tailored to the semantic homogeneity-aware token clustering. Secondly, an efficient Cross-Attention-guided Upsampling (CAU) block is proposed for token reconstruction from optimized supertokens. Thirdly, the 3DLST is equipped with a novel W-net architecture instead of the common U-net design, which is more suitable for Transformer-based feature learning. The SOTA performance on three challenging LiDAR datasets (airborne MultiSpectral LiDAR (MS-LiDAR) (89.3% of the average F1 score), DALES (80.2% of mIoU), and Toronto-3D dataset (80.4% of mIoU)) demonstrate the superiority of 3DLST and its strong adaptability to various LiDAR point cloud data (airborne MS-LiDAR, aerial LiDAR, and vehicle-mounted LiDAR data). Furthermore, 3DLST also achieves satisfactory results in terms of algorithm efficiency, which is up to 5x faster than previous best-performing methods.
Paper Structure (17 sections, 7 equations, 10 figures, 7 tables)

This paper contains 17 sections, 7 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Architecture of 3DLST for point cloud segmentation. It has two main modules for token clustering and deep feature extraction, constituting the novel W-net structure. Specifically, DSO represents Dynamic Supertoken Optimization. DFE represents Deep Feature Enhancement. CAU represents Cross-attention-guided Upsampling. STS represents SuperToken Sparsification. The supertoken clustering and token reconstruction results on each module show the dynamic optimization process of the network. Besides, we also provide a brief illustration of the DSO block.
  • Figure 2: Illustration of the cross-attention calculation, where visualization results of CAM before/after argmax are shown clearly.
  • Figure 3: Illustration of the DFE block.
  • Figure 4: Illustration of the CAU block.
  • Figure 5: Token reconstruction results on each module during network training, where the training and validation loss curves are also shown. The reconstruction results gradually approach the ground truth as the network is trained, illustrating the strong feature modeling capabilities of 3DLST.
  • ...and 5 more figures