SynthmanticLiDAR: A Synthetic Dataset for Semantic Segmentation on LiDAR Imaging
Javier Montalvo, Pablo Carballeira, Álvaro García-Martín
TL;DR
The paper presents SynthmanticLiDAR, a synthetic LiDAR semantic segmentation dataset generated with a modified CARLA simulator designed to closely match SemanticKITTI in class definitions and distribution. By pre-training segmentation models on SynthmanticLiDAR and fine-tuning on SemanticKITTI, the authors demonstrate improved performance for state-of-the-art methods SPVCNN and SqueezeSegV3, illustrating the value of synthetic data for reducing labeling costs and enhancing generalization. The LT subset further reveals a trade-off between underrepresented and well-represented classes, highlighting the need for balanced transfer learning. The dataset and accompanying tools are released publicly to enable further exploration of synthetic-to-real transfer and distribution-aware data generation in LiDAR perception.
Abstract
Semantic segmentation on LiDAR imaging is increasingly gaining attention, as it can provide useful knowledge for perception systems and potential for autonomous driving. However, collecting and labeling real LiDAR data is an expensive and time-consuming task. While datasets such as SemanticKITTI have been manually collected and labeled, the introduction of simulation tools such as CARLA, has enabled the creation of synthetic datasets on demand. In this work, we present a modified CARLA simulator designed with LiDAR semantic segmentation in mind, with new classes, more consistent object labeling with their counterparts from real datasets such as SemanticKITTI, and the possibility to adjust the object class distribution. Using this tool, we have generated SynthmanticLiDAR, a synthetic dataset for semantic segmentation on LiDAR imaging, designed to be similar to SemanticKITTI, and we evaluate its contribution to the training process of different semantic segmentation algorithms by using a naive transfer learning approach. Our results show that incorporating SynthmanticLiDAR into the training process improves the overall performance of tested algorithms, proving the usefulness of our dataset, and therefore, our adapted CARLA simulator. The dataset and simulator are available in https://github.com/vpulab/SynthmanticLiDAR.
