Table of Contents
Fetching ...

LidarMultiNet: Towards a Unified Multi-Task Network for LiDAR Perception

Dongqiangzi Ye, Zixiang Zhou, Weijia Chen, Yufei Xie, Yu Wang, Panqu Wang, Hassan Foroosh

TL;DR

LidarMultiNet presents a unified, end-to-end network that jointly performs 3D semantic segmentation, 3D object detection, and panoptic segmentation from LiDAR data using a voxel-based 3D encoder–decoder with Global Context Pooling. A second-stage, point-based refinement further enhances foreground segmentation and panoptic quality, enabling high-precision outputs while sharing computation across tasks. The model achieves state-of-the-art results on Waymo Open Dataset and nuScenes, outperforming prior single-task and multi-task approaches and demonstrating the practicality of unifying LiDAR perception tasks in a single architecture. This unified approach reduces cost and complexity and supports straightforward extension to additional tasks via new task-specific heads.

Abstract

LiDAR-based 3D object detection, semantic segmentation, and panoptic segmentation are usually implemented in specialized networks with distinctive architectures that are difficult to adapt to each other. This paper presents LidarMultiNet, a LiDAR-based multi-task network that unifies these three major LiDAR perception tasks. Among its many benefits, a multi-task network can reduce the overall cost by sharing weights and computation among multiple tasks. However, it typically underperforms compared to independently combined single-task models. The proposed LidarMultiNet aims to bridge the performance gap between the multi-task network and multiple single-task networks. At the core of LidarMultiNet is a strong 3D voxel-based encoder-decoder architecture with a Global Context Pooling (GCP) module extracting global contextual features from a LiDAR frame. Task-specific heads are added on top of the network to perform the three LiDAR perception tasks. More tasks can be implemented simply by adding new task-specific heads while introducing little additional cost. A second stage is also proposed to refine the first-stage segmentation and generate accurate panoptic segmentation results. LidarMultiNet is extensively tested on both Waymo Open Dataset and nuScenes dataset, demonstrating for the first time that major LiDAR perception tasks can be unified in a single strong network that is trained end-to-end and achieves state-of-the-art performance. Notably, LidarMultiNet reaches the official 1st place in the Waymo Open Dataset 3D semantic segmentation challenge 2022 with the highest mIoU and the best accuracy for most of the 22 classes on the test set, using only LiDAR points as input. It also sets the new state-of-the-art for a single model on the Waymo 3D object detection benchmark and three nuScenes benchmarks.

LidarMultiNet: Towards a Unified Multi-Task Network for LiDAR Perception

TL;DR

LidarMultiNet presents a unified, end-to-end network that jointly performs 3D semantic segmentation, 3D object detection, and panoptic segmentation from LiDAR data using a voxel-based 3D encoder–decoder with Global Context Pooling. A second-stage, point-based refinement further enhances foreground segmentation and panoptic quality, enabling high-precision outputs while sharing computation across tasks. The model achieves state-of-the-art results on Waymo Open Dataset and nuScenes, outperforming prior single-task and multi-task approaches and demonstrating the practicality of unifying LiDAR perception tasks in a single architecture. This unified approach reduces cost and complexity and supports straightforward extension to additional tasks via new task-specific heads.

Abstract

LiDAR-based 3D object detection, semantic segmentation, and panoptic segmentation are usually implemented in specialized networks with distinctive architectures that are difficult to adapt to each other. This paper presents LidarMultiNet, a LiDAR-based multi-task network that unifies these three major LiDAR perception tasks. Among its many benefits, a multi-task network can reduce the overall cost by sharing weights and computation among multiple tasks. However, it typically underperforms compared to independently combined single-task models. The proposed LidarMultiNet aims to bridge the performance gap between the multi-task network and multiple single-task networks. At the core of LidarMultiNet is a strong 3D voxel-based encoder-decoder architecture with a Global Context Pooling (GCP) module extracting global contextual features from a LiDAR frame. Task-specific heads are added on top of the network to perform the three LiDAR perception tasks. More tasks can be implemented simply by adding new task-specific heads while introducing little additional cost. A second stage is also proposed to refine the first-stage segmentation and generate accurate panoptic segmentation results. LidarMultiNet is extensively tested on both Waymo Open Dataset and nuScenes dataset, demonstrating for the first time that major LiDAR perception tasks can be unified in a single strong network that is trained end-to-end and achieves state-of-the-art performance. Notably, LidarMultiNet reaches the official 1st place in the Waymo Open Dataset 3D semantic segmentation challenge 2022 with the highest mIoU and the best accuracy for most of the 22 classes on the test set, using only LiDAR points as input. It also sets the new state-of-the-art for a single model on the Waymo 3D object detection benchmark and three nuScenes benchmarks.
Paper Structure (25 sections, 4 equations, 7 figures, 9 tables)

This paper contains 25 sections, 4 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Our LidarMultiNet takes LiDAR point cloud (a) as input and performs simultaneous 3D semantic segmentation (b), 3D object detection (c), and panoptic segmentation (d) in a single unified network.
  • Figure 2: Main Architecture of the LidarMultiNet. At the core of our network is a 3D encoder-decoder based on 3D sparse convolution and deconvolutions. In between the encoder and the decoder, a Global Context Pooling (GCP) module is applied to extract contextual information through the conversion between sparse and dense feature maps and via a 2D multi-scale feature extractor. The 3D segmentation head is attached to the decoder and its predicted voxel labels are projected back to the point level via a de-voxelization step. Meanwhile, the 3D detection head and auxiliary BEV segmentation head are attached to the 2D BEV branch. The 2nd-stage produces the refined semantic segmentation and the panoptic segmentation results.
  • Figure 3: Illustration of the Global Context Pooling (GCP) module. 3D sparse tensor is projected to a 2D BEV feature map. Two levels of 2D BEV feature maps are concatenated and then converted back to a 3D sparse tensor, which serves as the input to the BEV task heads.
  • Figure 4: Illustration of the second-stage refinement pipeline. The architecture of the second-stage refinement is point-based. We first fuse the detected boxes, voxel-wise features, and BEV features from the 1st stage to generate the inputs for the 2nd stage. The local coordinate transformation is applied to the points within each box. Then, a point-based backbone with MLPs, attention modules, and aggregation modules infer the box classification scores and point-wise mask scores. The final refined segmentation scores are computed by fusing the 1st and 2nd stage predictions.
  • Figure 5: Examples of the 2nd-stage refinement. The segmentation consistency of points of the thing objects can be improved by the 2nd stage.
  • ...and 2 more figures