Table of Contents
Fetching ...

LiDARFormer: A Unified Transformer-based Multi-task Network for LiDAR Perception

Zixiang Zhou, Dongqiangzi Ye, Weijia Chen, Yufei Xie, Yu Wang, Panqu Wang, Hassan Foroosh

TL;DR

LiDARFormer tackles the challenge of jointly learning 3D detection and semantic segmentation from LiDAR by unifying cross-space and cross-task context through transformers. It introduces a cross-space transformer to fuse dense BEV and sparse voxel features and a cross-task transformer decoder that shares high-level class- and object-level representations between tasks, enabling end-to-end multi-task learning. The method achieves state-of-the-art results on nuScenes and Waymo Open, including 81.5% mIoU and 74.3% NDS on nuScenes, and 76.4% L2 mAPH on Waymo Open, while maintaining a compact model size (~77M parameters). This unified design improves both tasks through cross-task attention and global context, with potential extensions to multi-modality and temporal fusion in future work.

Abstract

There is a recent trend in the LiDAR perception field towards unifying multiple tasks in a single strong network with improved performance, as opposed to using separate networks for each task. In this paper, we introduce a new LiDAR multi-task learning paradigm based on the transformer. The proposed LiDARFormer utilizes cross-space global contextual feature information and exploits cross-task synergy to boost the performance of LiDAR perception tasks across multiple large-scale datasets and benchmarks. Our novel transformer-based framework includes a cross-space transformer module that learns attentive features between the 2D dense Bird's Eye View (BEV) and 3D sparse voxel feature maps. Additionally, we propose a transformer decoder for the segmentation task to dynamically adjust the learned features by leveraging the categorical feature representations. Furthermore, we combine the segmentation and detection features in a shared transformer decoder with cross-task attention layers to enhance and integrate the object-level and class-level features. LiDARFormer is evaluated on the large-scale nuScenes and the Waymo Open datasets for both 3D detection and semantic segmentation tasks, and it outperforms all previously published methods on both tasks. Notably, LiDARFormer achieves the state-of-the-art performance of 76.4% L2 mAPH and 74.3% NDS on the challenging Waymo and nuScenes detection benchmarks for a single model LiDAR-only method.

LiDARFormer: A Unified Transformer-based Multi-task Network for LiDAR Perception

TL;DR

LiDARFormer tackles the challenge of jointly learning 3D detection and semantic segmentation from LiDAR by unifying cross-space and cross-task context through transformers. It introduces a cross-space transformer to fuse dense BEV and sparse voxel features and a cross-task transformer decoder that shares high-level class- and object-level representations between tasks, enabling end-to-end multi-task learning. The method achieves state-of-the-art results on nuScenes and Waymo Open, including 81.5% mIoU and 74.3% NDS on nuScenes, and 76.4% L2 mAPH on Waymo Open, while maintaining a compact model size (~77M parameters). This unified design improves both tasks through cross-task attention and global context, with potential extensions to multi-modality and temporal fusion in future work.

Abstract

There is a recent trend in the LiDAR perception field towards unifying multiple tasks in a single strong network with improved performance, as opposed to using separate networks for each task. In this paper, we introduce a new LiDAR multi-task learning paradigm based on the transformer. The proposed LiDARFormer utilizes cross-space global contextual feature information and exploits cross-task synergy to boost the performance of LiDAR perception tasks across multiple large-scale datasets and benchmarks. Our novel transformer-based framework includes a cross-space transformer module that learns attentive features between the 2D dense Bird's Eye View (BEV) and 3D sparse voxel feature maps. Additionally, we propose a transformer decoder for the segmentation task to dynamically adjust the learned features by leveraging the categorical feature representations. Furthermore, we combine the segmentation and detection features in a shared transformer decoder with cross-task attention layers to enhance and integrate the object-level and class-level features. LiDARFormer is evaluated on the large-scale nuScenes and the Waymo Open datasets for both 3D detection and semantic segmentation tasks, and it outperforms all previously published methods on both tasks. Notably, LiDARFormer achieves the state-of-the-art performance of 76.4% L2 mAPH and 74.3% NDS on the challenging Waymo and nuScenes detection benchmarks for a single model LiDAR-only method.
Paper Structure (16 sections, 4 equations, 10 figures, 13 tables)

This paper contains 16 sections, 4 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: LiDAR Perception Network Designs. LiDAR detection (a) and segmentation (b) networks typically extract feature representations on distinct feature maps. While a recent multi-task network ye2022lidarmultinet (c) integrates these tasks into a single network, it often overlooks differences among feature maps and the higher-level connections between tasks. Our network (d) utilizes transformer attention to establish more effectively the transformations between 3D sparse and 2D dense features. Moreover, the cross-task information is further shared through class-level and object-level feature embeddings in the multi-task transformer decoder.
  • Figure 2: The architecture of LiDARFormer. Our network first transforms the point cloud into a sparse voxel map. Next, sparse 3D CNN is used to extract voxel feature representation. Between the encoder and the decoder, we use a Cross-space Transformer (XSF) module to learn long-range information in the BEV map. Additionally, we use a cross-task transformer decoder (XTF) to extract class-level and object-level feature representations, which are fed into task-specific heads to generate the detection and segmentation predictions.
  • Figure 3: Dense-to-sparse Cross-space Transformer
  • Figure 4: Sparse-to-dense Cross-space Transformer
  • Figure 6: Cross-task Transformer (XTF). The segmentation and detection decoders share a self-attention layer to transfer the cross-task features. In the segmentation decoder, we use a bidirectional cross-attention to refine voxel features based on the aggregated class feature embedding. For simplicity, the skip connection and the layer norm are ignored in this figure.
  • ...and 5 more figures