Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity
Vasile Marian, Yong-Bin Kang, Alexander Buddery
TL;DR
This work tackles the challenge of predicting downstream object-detection gains from synthetic augmentation, addressing the gap where global generative metrics often fail to anticipate mAP improvements. It implements a controlled evaluation of synthetic augmentation for YOLOv11 across three regime-diverse datasets (Traffic Signs, Cityscapes Pedestrian, COCO PottedPlant) and six generator families, over a wide augmentation budget, with both from-scratch and COCO-pretrained initialization. The authors compare global embedding-based metrics (Inception-v3 and DINOv2) and object-centric bounding-box statistics using a matched-size bootstrap, and they perform augmentation-controlled residual correlations to isolate metric signal from the augmentation amount. The main findings show regime- and initialization-dependent gains, with the strongest effects in challenging regimes and no universal pre-training metric that reliably predicts performance; however, augmentation-controlled metrics and object-centric diagnostics can provide dataset-specific guidance for prioritizing generators and budgets. This work offers practical, fixed-budget screening guidance for synthetic data selection while highlighting the limitations of universal metrics across diverse detection regimes.
Abstract
Synthetic images are increasingly used to augment object-detection training sets, but reliably evaluating a synthetic dataset before training remains difficult: standard global generative metrics (e.g., FID) often do not predict downstream detection mAP. We present a controlled evaluation of synthetic augmentation for YOLOv11 across three single-class detection regimes -- Traffic Signs (sparse/near-saturated), Cityscapes Pedestrian (dense/occlusion-heavy), and COCO PottedPlant (multi-instance/high-variability). We benchmark six GAN-, diffusion-, and hybrid-based generators over augmentation ratios from 10% to 150% of the real training split, and train YOLOv11 both from scratch and with COCO-pretrained initialization, evaluating on held-out real test splits (mAP@0.50:0.95). For each dataset-generator-augmentation configuration, we compute pre-training dataset metrics under a matched-size bootstrap protocol, including (i) global feature-space metrics in both Inception-v3 and DINOv2 embeddings and (ii) object-centric distribution distances over bounding-box statistics. Synthetic augmentation yields substantial gains in the more challenging regimes (up to +7.6% and +30.6% relative mAP in Pedestrian and PottedPlant, respectively) but is marginal in Traffic Signs and under pretrained fine-tuning. To separate metric signal from augmentation quantity, we report both raw and augmentation-controlled (residualized) correlations with multiple-testing correction, showing that metric-performance alignment is strongly regime-dependent and that many apparent raw associations weaken after controlling for augmentation level.
