Table of Contents
Fetching ...

SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding

Dekai Zhu, Yixuan Hu, Youquan Liu, Dongyue Lu, Lingdong Kong, Slobodan Ilic

TL;DR

This paper tackles the bottleneck of labeled LiDAR data for large-scale scene understanding by introducing Spiral, a semantic-aware diffusion model that jointly generates depth, reflectance, and semantic maps directly in the range-view domain. It innovates with progressive semantic predictions, a closed-loop inference mechanism, and semantic-aware evaluation metrics to ensure cross-modal consistency and high-quality labeled outputs. Empirical results on SemanticKITTI and nuScenes show Spiral achieving state-of-the-art performance with a compact 61M parameter model, and the generated range images prove effective for synthetic data augmentation in segmentation tasks. The work offers a practical path toward label-efficient 3D perception and sets new benchmarks for semantic-aware LiDAR generation and evaluation.

Abstract

Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose Spiral, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by Spiral can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.

SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding

TL;DR

This paper tackles the bottleneck of labeled LiDAR data for large-scale scene understanding by introducing Spiral, a semantic-aware diffusion model that jointly generates depth, reflectance, and semantic maps directly in the range-view domain. It innovates with progressive semantic predictions, a closed-loop inference mechanism, and semantic-aware evaluation metrics to ensure cross-modal consistency and high-quality labeled outputs. Empirical results on SemanticKITTI and nuScenes show Spiral achieving state-of-the-art performance with a compact 61M parameter model, and the generated range images prove effective for synthetic data augmentation in segmentation tasks. The work offers a practical path toward label-efficient 3D perception and sets new benchmarks for semantic-aware LiDAR generation and evaluation.

Abstract

Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose Spiral, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by Spiral can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.

Paper Structure

This paper contains 36 sections, 15 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Visualizations of LiDAR scenes and their semantic labels jointly generated by Spiral, exhibiting high geometric fidelity, semantic–geometric consistency, and abundant downstream task utilization for robotics and autonomous driving.
  • Figure 2: (a) Two-step methods: Existing range-view LiDAR generative models typically generate only depth and reflectance images, requiring an additional pre-trained segmentation model to predict semantic labels. (b) Spiral: In contrast, Spiral jointly generates depth, reflectance, and semantic maps. A closed-loop inference mechanism (highlighted in the dash arrow) further improves cross-modal consistency. (c) Results: Spiral achieves state-of-the-art performance with the smallest parameter size ($61$M) among the related methods.
  • Figure 3: (a) Unconditional Step: Spiral takes noisy LiDAR scenes $x_t$ as input and predicts both the semantic map $\hat{y}_t$ and the noise $\hat{\epsilon}_t$, where the switch $\mathcal{A}$ is off and $\mathcal{B}$ is on. (b) Conditional Step: Spiral predicts $\hat{\epsilon}_t$ conditioned on the given semantic map $y$, where $\mathcal{A}$ is on and $\mathcal{B}$ is off. (c) During inference, Spiral begins in an open-loop mode with unconditional steps. Once the predicted semantic map smoothed by the progressive filter reaches high confidence, Spiral switches to a closed-loop mode that alternates between unconditional and conditional steps, enhancing cross–modal consistency.
  • Figure 4: (a) Range-view based semantic-aware feature $f^s$ is constructed by concatenating the features extracted by the RangeNet++ behley2019semantickitti encoder and the LiDM lidm semantic encoder from the LiDAR scene $x$ and the semantic map $y$, respectively. (b) BEV-based semantic-aware feature $h^s$ is constructed by aggregating per-category 2D histograms.
  • Figure 5: Visualizations of generated LiDAR scenes on SemanticKITTI behley2019semantickitti. For two-step methods, we use the labels produced by RangeNet++ milioto2019rangenet++ due to its superior performance over SPVCNN++ liu2023uniseg. Artifacts are highlighted with dashed boxes. Examples of semantic artifacts are shown in 7, 8, 9, and 11, while geometric artifacts such as local distortion and large noise are illustrated in 10 and 12.
  • ...and 5 more figures