Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving

Lucas Nunes; Rodrigo Marcuzzi; Jens Behley; Cyrill Stachniss

Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving

Lucas Nunes, Rodrigo Marcuzzi, Jens Behley, Cyrill Stachniss

TL;DR

The paper tackles the bottleneck of annotating 3D semantic data for autonomous driving by introducing a latent diffusion framework that operates directly on a single, sparse 3D VAE without relying on image projections or multi-resolution decoupled models. A VAE learns a descriptive latent space with pruning to capture hierarchical scene structure, while a latent DDPM generates new scene-scale semantic data decoded back to dense 3D scenes. The approach yields more realistic scene generation than prior methods and, when used as training data, improves 3D semantic segmentation performance, demonstrating practical utility for expanding labeled datasets. The authors also analyze gaps between real and generated data, emphasizing class-imbalance as a critical factor and proposing conditioned DDPM as a data annotator to enable targeted data generation for specific scenarios.

Abstract

Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still however a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close-to-real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene-scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse-to-fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models, achieving more realistic semantic scene data generation compared to previous state-of-the-art methods. Besides improving 3D semantic scene-scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene-scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort. Our code is available at https://github.com/PRBonn/3DiSS.

Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving

TL;DR

Abstract

Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)