Table of Contents
Fetching ...

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin

TL;DR

The Synthetic Dataset Quality Metric (SDQM) is introduced to assess data quality for object detection tasks without requiring model training to converge, and provides actionable insights for improving dataset quality, minimizing the need for costly iterative training.

Abstract

The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. This paper introduces the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model, while previous metrics only exhibited moderate or weak correlations. Additionally, it provides actionable insights for improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

TL;DR

The Synthetic Dataset Quality Metric (SDQM) is introduced to assess data quality for object detection tasks without requiring model training to converge, and provides actionable insights for improving dataset quality, minimizing the need for costly iterative training.

Abstract

The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. This paper introduces the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model, while previous metrics only exhibited moderate or weak correlations. Additionally, it provides actionable insights for improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM

Paper Structure

This paper contains 18 sections, 2 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Correlation between each sub-metric and mAP50.
  • Figure 3: Validation datapoints with SDQM values calculated from random forest coefficients vs. YOLOv11n mAP50 scores.
  • Figure 4: Validation datapoints with SDQM values calculated from linear regression coefficients vs. YOLOv11n mAP50 scores.
  • Figure 5: Validation datapoints with SDQM values calculated from ridge regression coefficients vs. YOLOv11n mAP50 scores.
  • Figure 6: Validation datapoints with SDQM values calculated from XGBoost coefficients vs. YOLOv11n mAP50 scores.
  • ...and 2 more figures