Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity

Vasile Marian; Yong-Bin Kang; Alexander Buddery

Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity

Vasile Marian, Yong-Bin Kang, Alexander Buddery

TL;DR

This work tackles the challenge of predicting downstream object-detection gains from synthetic augmentation, addressing the gap where global generative metrics often fail to anticipate mAP improvements. It implements a controlled evaluation of synthetic augmentation for YOLOv11 across three regime-diverse datasets (Traffic Signs, Cityscapes Pedestrian, COCO PottedPlant) and six generator families, over a wide augmentation budget, with both from-scratch and COCO-pretrained initialization. The authors compare global embedding-based metrics (Inception-v3 and DINOv2) and object-centric bounding-box statistics using a matched-size bootstrap, and they perform augmentation-controlled residual correlations to isolate metric signal from the augmentation amount. The main findings show regime- and initialization-dependent gains, with the strongest effects in challenging regimes and no universal pre-training metric that reliably predicts performance; however, augmentation-controlled metrics and object-centric diagnostics can provide dataset-specific guidance for prioritizing generators and budgets. This work offers practical, fixed-budget screening guidance for synthetic data selection while highlighting the limitations of universal metrics across diverse detection regimes.

Abstract

Synthetic images are increasingly used to augment object-detection training sets, but reliably evaluating a synthetic dataset before training remains difficult: standard global generative metrics (e.g., FID) often do not predict downstream detection mAP. We present a controlled evaluation of synthetic augmentation for YOLOv11 across three single-class detection regimes -- Traffic Signs (sparse/near-saturated), Cityscapes Pedestrian (dense/occlusion-heavy), and COCO PottedPlant (multi-instance/high-variability). We benchmark six GAN-, diffusion-, and hybrid-based generators over augmentation ratios from 10% to 150% of the real training split, and train YOLOv11 both from scratch and with COCO-pretrained initialization, evaluating on held-out real test splits (mAP@0.50:0.95). For each dataset-generator-augmentation configuration, we compute pre-training dataset metrics under a matched-size bootstrap protocol, including (i) global feature-space metrics in both Inception-v3 and DINOv2 embeddings and (ii) object-centric distribution distances over bounding-box statistics. Synthetic augmentation yields substantial gains in the more challenging regimes (up to +7.6% and +30.6% relative mAP in Pedestrian and PottedPlant, respectively) but is marginal in Traffic Signs and under pretrained fine-tuning. To separate metric signal from augmentation quantity, we report both raw and augmentation-controlled (residualized) correlations with multiple-testing correction, showing that metric-performance alignment is strongly regime-dependent and that many apparent raw associations weaken after controlling for augmentation level.

Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity

TL;DR

Abstract

Paper Structure (51 sections, 4 equations, 13 figures, 6 tables)

This paper contains 51 sections, 4 equations, 13 figures, 6 tables.

Introduction
Related Work
Evaluation Framework: Synthetic Augmentation and Metric--Performance Analysis for YOLOv11
Results
YOLOv11 performance under synthetic augmentation
Metric--performance alignment
From-Scratch: metric signal is dataset-dependent.
Discussion and Conclusions
Supplementary dataset examples and regime statistics
Dataset curation and preprocessing details
YOLOv11 training details
Additional real dataset examples
Dataset regime statistics (real training splits)
Interpretation (regime differences).
Additional results and diagnostics
...and 36 more sections

Figures (13)

Figure 1: Overview of the four-stage pipeline: dataset curation & baselines; synthetic generation & labeling; YOLO training/evaluation; metric--performance analysis.
Figure 2: Representative preprocessed samples from the three datasets used in this study: (a) Cityscapes Pedestrian, (b) Traffic Signs, and (c) COCO PottedPlant.
Figure 3: YOLOv11 From-scratchtraining-time validation mAP@0.50:0.95 vs. augmentation ratio. Each curve corresponds to one generator; the horizontal line indicates the real-only baseline. Curves report the best validation mAP observed during training (from Ultralytics training logs).
Figure 4: Residual Spearman correlations between synthetic-data metrics and YOLOv11 mAP@0.50:0.95 for the From-Scratch regime, controlling for augmentation ratio. Asterisks denote BH--FDR corrected significance at $q<0.05$Benjamini1995.
Figure 5: Additional real samples from Cityscapes Pedestrian. Scenes are dense with frequent occlusion and multiple small objects.
...and 8 more figures

Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity

TL;DR

Abstract

Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity

Authors

TL;DR

Abstract

Table of Contents

Figures (13)