License Plate Images Generation with Diffusion Models
Mariia Shpir, Nadiya Shvai, Amir Nakib
TL;DR
This work tackles data scarcity in license plate recognition caused by privacy regulations by training a diffusion model (DDPM) on Ukrainian plates to synthesize realistic LP images. The authors generate 1,000 synthetic plates for detailed analysis and release a 10,000-image synthetic Ukrainian LP dataset to enable broader LPR research, using pseudolabeling to scale training data. They validate the approach with thorough analyses of readability, symbol and regional distributions, and an LPR task evaluation showing improvements of about 3 percentage points over baselines when synthetic data is included. The findings demonstrate the practical viability of diffusion-based data augmentation for LPR and provide a valuable resource for future research and benchmarking in GDPR-constrained settings.
Abstract
Despite the evident practical importance of license plate recognition (LPR), corresponding research is limited by the volume of publicly available datasets due to privacy regulations such as the General Data Protection Regulation (GDPR). To address this challenge, synthetic data generation has emerged as a promising approach. In this paper, we propose to synthesize realistic license plates (LPs) using diffusion models, inspired by recent advances in image and video generation. In our experiments a diffusion model was successfully trained on a Ukrainian LP dataset, and 1000 synthetic images were generated for detailed analysis. Through manual classification and annotation of the generated images, we performed a thorough study of the model output, such as success rate, character distributions, and type of failures. Our contributions include experimental validation of the efficacy of diffusion models for LP synthesis, along with insights into the characteristics of the generated data. Furthermore, we have prepared a synthetic dataset consisting of 10,000 LP images, publicly available at https://zenodo.org/doi/10.5281/zenodo.13342102. Conducted experiments empirically confirm the usefulness of synthetic data for the LPR task. Despite the initial performance gap between the model trained with real and synthetic data, the expansion of the training data set with pseudolabeled synthetic data leads to an improvement in LPR accuracy by 3% compared to baseline.
