Table of Contents
Fetching ...

Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity

Vasile Marian, Yong-Bin Kang, Alexander Buddery

TL;DR

This work tackles the challenge of predicting downstream object-detection gains from synthetic augmentation, addressing the gap where global generative metrics often fail to anticipate mAP improvements. It implements a controlled evaluation of synthetic augmentation for YOLOv11 across three regime-diverse datasets (Traffic Signs, Cityscapes Pedestrian, COCO PottedPlant) and six generator families, over a wide augmentation budget, with both from-scratch and COCO-pretrained initialization. The authors compare global embedding-based metrics (Inception-v3 and DINOv2) and object-centric bounding-box statistics using a matched-size bootstrap, and they perform augmentation-controlled residual correlations to isolate metric signal from the augmentation amount. The main findings show regime- and initialization-dependent gains, with the strongest effects in challenging regimes and no universal pre-training metric that reliably predicts performance; however, augmentation-controlled metrics and object-centric diagnostics can provide dataset-specific guidance for prioritizing generators and budgets. This work offers practical, fixed-budget screening guidance for synthetic data selection while highlighting the limitations of universal metrics across diverse detection regimes.

Abstract

Synthetic images are increasingly used to augment object-detection training sets, but reliably evaluating a synthetic dataset before training remains difficult: standard global generative metrics (e.g., FID) often do not predict downstream detection mAP. We present a controlled evaluation of synthetic augmentation for YOLOv11 across three single-class detection regimes -- Traffic Signs (sparse/near-saturated), Cityscapes Pedestrian (dense/occlusion-heavy), and COCO PottedPlant (multi-instance/high-variability). We benchmark six GAN-, diffusion-, and hybrid-based generators over augmentation ratios from 10% to 150% of the real training split, and train YOLOv11 both from scratch and with COCO-pretrained initialization, evaluating on held-out real test splits (mAP@0.50:0.95). For each dataset-generator-augmentation configuration, we compute pre-training dataset metrics under a matched-size bootstrap protocol, including (i) global feature-space metrics in both Inception-v3 and DINOv2 embeddings and (ii) object-centric distribution distances over bounding-box statistics. Synthetic augmentation yields substantial gains in the more challenging regimes (up to +7.6% and +30.6% relative mAP in Pedestrian and PottedPlant, respectively) but is marginal in Traffic Signs and under pretrained fine-tuning. To separate metric signal from augmentation quantity, we report both raw and augmentation-controlled (residualized) correlations with multiple-testing correction, showing that metric-performance alignment is strongly regime-dependent and that many apparent raw associations weaken after controlling for augmentation level.

Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity

TL;DR

This work tackles the challenge of predicting downstream object-detection gains from synthetic augmentation, addressing the gap where global generative metrics often fail to anticipate mAP improvements. It implements a controlled evaluation of synthetic augmentation for YOLOv11 across three regime-diverse datasets (Traffic Signs, Cityscapes Pedestrian, COCO PottedPlant) and six generator families, over a wide augmentation budget, with both from-scratch and COCO-pretrained initialization. The authors compare global embedding-based metrics (Inception-v3 and DINOv2) and object-centric bounding-box statistics using a matched-size bootstrap, and they perform augmentation-controlled residual correlations to isolate metric signal from the augmentation amount. The main findings show regime- and initialization-dependent gains, with the strongest effects in challenging regimes and no universal pre-training metric that reliably predicts performance; however, augmentation-controlled metrics and object-centric diagnostics can provide dataset-specific guidance for prioritizing generators and budgets. This work offers practical, fixed-budget screening guidance for synthetic data selection while highlighting the limitations of universal metrics across diverse detection regimes.

Abstract

Synthetic images are increasingly used to augment object-detection training sets, but reliably evaluating a synthetic dataset before training remains difficult: standard global generative metrics (e.g., FID) often do not predict downstream detection mAP. We present a controlled evaluation of synthetic augmentation for YOLOv11 across three single-class detection regimes -- Traffic Signs (sparse/near-saturated), Cityscapes Pedestrian (dense/occlusion-heavy), and COCO PottedPlant (multi-instance/high-variability). We benchmark six GAN-, diffusion-, and hybrid-based generators over augmentation ratios from 10% to 150% of the real training split, and train YOLOv11 both from scratch and with COCO-pretrained initialization, evaluating on held-out real test splits (mAP@0.50:0.95). For each dataset-generator-augmentation configuration, we compute pre-training dataset metrics under a matched-size bootstrap protocol, including (i) global feature-space metrics in both Inception-v3 and DINOv2 embeddings and (ii) object-centric distribution distances over bounding-box statistics. Synthetic augmentation yields substantial gains in the more challenging regimes (up to +7.6% and +30.6% relative mAP in Pedestrian and PottedPlant, respectively) but is marginal in Traffic Signs and under pretrained fine-tuning. To separate metric signal from augmentation quantity, we report both raw and augmentation-controlled (residualized) correlations with multiple-testing correction, showing that metric-performance alignment is strongly regime-dependent and that many apparent raw associations weaken after controlling for augmentation level.
Paper Structure (51 sections, 4 equations, 13 figures, 6 tables)

This paper contains 51 sections, 4 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Overview of the four-stage pipeline: dataset curation & baselines; synthetic generation & labeling; YOLO training/evaluation; metric--performance analysis.
  • Figure 2: Representative preprocessed samples from the three datasets used in this study: (a) Cityscapes Pedestrian, (b) Traffic Signs, and (c) COCO PottedPlant.
  • Figure 3: YOLOv11 From-scratchtraining-time validation mAP@0.50:0.95 vs. augmentation ratio. Each curve corresponds to one generator; the horizontal line indicates the real-only baseline. Curves report the best validation mAP observed during training (from Ultralytics training logs).
  • Figure 4: Residual Spearman correlations between synthetic-data metrics and YOLOv11 mAP@0.50:0.95 for the From-Scratch regime, controlling for augmentation ratio. Asterisks denote BH--FDR corrected significance at $q<0.05$Benjamini1995.
  • Figure 5: Additional real samples from Cityscapes Pedestrian. Scenes are dense with frequent occlusion and multiple small objects.
  • ...and 8 more figures