Table of Contents
Fetching ...

Visual Car Brand Classification by Implementing a Synthetic Image Dataset Creation Pipeline

Jan Lippemeier, Stefanie Hittmeyer, Oliver Niehörster, Markus Lange-Hegermann

TL;DR

This work tackles data scarcity in car-brand classification from real traffic footage by proposing an automated pipeline that generates labeled synthetic images with Stable Diffusion and validates them through YOLOv8 bounding-box detection. The method uses German vehicle-registration data to construct a balanced label distribution, generates images in Text-to-Image and Image-to-Image modes, crops to single-car regions, and trains a ResNet-18 classifier in a transfer-learning setup. Key findings show that a synthetic-data-only approach can achieve up to 75% accuracy on real-world tests when using a large, diverse synthetic dataset and combining diffusion modes, albeit with biases toward common brands and notable domain gaps. The approach promises rapid dataset creation without manual labeling, offering practical benefits for data-scarce computer vision tasks while highlighting areas for improvement in bias mitigation and generalization.

Abstract

Recent advancements in machine learning, particularly in deep learning and object detection, have significantly improved performance in various tasks, including image classification and synthesis. However, challenges persist, particularly in acquiring labeled data that accurately represents specific use cases. In this work, we propose an automatic pipeline for generating synthetic image datasets using Stable Diffusion, an image synthesis model capable of producing highly realistic images. We leverage YOLOv8 for automatic bounding box detection and quality assessment of synthesized images. Our contributions include demonstrating the feasibility of training image classifiers solely on synthetic data, automating the image generation pipeline, and describing the computational requirements for our approach. We evaluate the usability of different modes of Stable Diffusion and achieve a classification accuracy of 75%.

Visual Car Brand Classification by Implementing a Synthetic Image Dataset Creation Pipeline

TL;DR

This work tackles data scarcity in car-brand classification from real traffic footage by proposing an automated pipeline that generates labeled synthetic images with Stable Diffusion and validates them through YOLOv8 bounding-box detection. The method uses German vehicle-registration data to construct a balanced label distribution, generates images in Text-to-Image and Image-to-Image modes, crops to single-car regions, and trains a ResNet-18 classifier in a transfer-learning setup. Key findings show that a synthetic-data-only approach can achieve up to 75% accuracy on real-world tests when using a large, diverse synthetic dataset and combining diffusion modes, albeit with biases toward common brands and notable domain gaps. The approach promises rapid dataset creation without manual labeling, offering practical benefits for data-scarce computer vision tasks while highlighting areas for improvement in bias mitigation and generalization.

Abstract

Recent advancements in machine learning, particularly in deep learning and object detection, have significantly improved performance in various tasks, including image classification and synthesis. However, challenges persist, particularly in acquiring labeled data that accurately represents specific use cases. In this work, we propose an automatic pipeline for generating synthetic image datasets using Stable Diffusion, an image synthesis model capable of producing highly realistic images. We leverage YOLOv8 for automatic bounding box detection and quality assessment of synthesized images. Our contributions include demonstrating the feasibility of training image classifiers solely on synthetic data, automating the image generation pipeline, and describing the computational requirements for our approach. We evaluate the usability of different modes of Stable Diffusion and achieve a classification accuracy of 75%.
Paper Structure (6 sections, 6 figures, 2 tables)

This paper contains 6 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The selected brands we aim to classify. These eight brands occur the most in our recorded footage.
  • Figure 2: The scheme of the developed pipeline consisting of dataset creation and training. The illustrated pipeline produces a dataset of synthetic images with corresponding labels and bounding boxes. The dataset can be used to automatically train an image classification model.
  • Figure 3: The prompt used to generate the images with Stable Diffusion alongside a generated image using Text-to-Image. The substring gray Volkswagen Golf VII 2015 is changed accordingly for different car models.
  • Figure 4: Illustration of differences between real images and modes of image generation. Text-to-Image tends to encompass more perspectives contrary to the narrow range of perspectives with Image-to-Image.
  • Figure 5: Bounding Boxes detected with YOLOv8x. This allows to crop the image and provides a confidence score for the presence of a car. It also allows to automatically sort out the two undesired images on the right where more than one car is detected.
  • ...and 1 more figures