Table of Contents
Fetching ...

Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving

Lucas Nunes, Rodrigo Marcuzzi, Jens Behley, Cyrill Stachniss

TL;DR

The paper tackles the bottleneck of annotating 3D semantic data for autonomous driving by introducing a latent diffusion framework that operates directly on a single, sparse 3D VAE without relying on image projections or multi-resolution decoupled models. A VAE learns a descriptive latent space with pruning to capture hierarchical scene structure, while a latent DDPM generates new scene-scale semantic data decoded back to dense 3D scenes. The approach yields more realistic scene generation than prior methods and, when used as training data, improves 3D semantic segmentation performance, demonstrating practical utility for expanding labeled datasets. The authors also analyze gaps between real and generated data, emphasizing class-imbalance as a critical factor and proposing conditioned DDPM as a data annotator to enable targeted data generation for specific scenarios.

Abstract

Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still however a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close-to-real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene-scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse-to-fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models, achieving more realistic semantic scene data generation compared to previous state-of-the-art methods. Besides improving 3D semantic scene-scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene-scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort. Our code is available at https://github.com/PRBonn/3DiSS.

Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving

TL;DR

The paper tackles the bottleneck of annotating 3D semantic data for autonomous driving by introducing a latent diffusion framework that operates directly on a single, sparse 3D VAE without relying on image projections or multi-resolution decoupled models. A VAE learns a descriptive latent space with pruning to capture hierarchical scene structure, while a latent DDPM generates new scene-scale semantic data decoded back to dense 3D scenes. The approach yields more realistic scene generation than prior methods and, when used as training data, improves 3D semantic segmentation performance, demonstrating practical utility for expanding labeled datasets. The authors also analyze gaps between real and generated data, emphasizing class-imbalance as a critical factor and proposing conditioned DDPM as a data annotator to enable targeted data generation for specific scenarios.

Abstract

Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still however a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close-to-real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene-scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse-to-fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models, achieving more realistic semantic scene data generation compared to previous state-of-the-art methods. Besides improving 3D semantic scene-scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene-scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort. Our code is available at https://github.com/PRBonn/3DiSS.

Paper Structure

This paper contains 17 sections, 9 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Our scene generation pipeline: First we train a VAE with the dense scenes $\mathcal{P}$ to reconstruct it as $\hat{\mathcal{P}}$ and to learn the sparse $\mathcal{Z}$ and dense ${\hbox{\boldmath$Z$}}$ latent spaces. Next, the DDPM $\theta$ is trained over the latent ${\hbox{\boldmath$Z$}}$ sampling a random step $t$ to compute the noisy latent ${\hbox{\boldmath$Z$}}^t$, training the model $\theta$ to predict $\textbf{v}_\theta^t$, following the v-parameterization formulation salimans2022iclr. Finally, novel scenes are generated by sampling random noise ${\hbox{\boldmath$Z$}}^T \sim \mathcal{N}\left(\boldsymbol{0}, \boldsymbol{I}\right)$ and denoising it with $\theta$ over $T$ times, arriving to ${\hbox{\boldmath$Z$}}^0 = {\hbox{\boldmath$Z$}}_\theta$, decoding it with the VAE decoder to get to the generated scene $\mathcal{P}' = \psi\left({\hbox{\boldmath$Z$}}_\theta\right)$.
  • Figure 2: Diagram of the pruning process. The pruning layer predicts and prune the unoccupied voxels before each upsampling layer, starting from the dense latent ${\hbox{\boldmath$Z$}}$.
  • Figure 3: Comparison of real and unconditioned generated scenes from different methods. PDD and SemCity scenes are limited to $0.2\,$m resolution. The baselines scenes present rounder and too smooth shapes. Our method can generate more fine-grained details, closer to real data.
  • Figure 4: Simulated LiDAR point clouds from a dense generated scene.
  • Figure 5: Semantic segmentation model performance trained with the different percentages of real data complemented with synthetic data from our model and XCube ren2024cvpr. (a) Model trained with LiDAR scans simulated from the dense scenes. (b) Model trained with dense scenes.
  • ...and 3 more figures