Table of Contents
Fetching ...

Exploring the Potential of Synthetic Data to Replace Real Data

Hyungtae Lee, Yan Zhang, Heesung Kwon, Shuvra S. Bhattacharrya

TL;DR

This work addresses whether synthetic data can replace real data in cross-domain training for UAV-view human detection. It introduces two metrics, train2test distance and AP$_{t2t}$, to quantify how well a training set (augmented with synthetic data via syn2real transformations) represents test instances in relation to training performance, and employs Progressive Transformation Learning (PTL) to curate synthetic samples. Across multiple UAV datasets, the study shows that synthetic data can significantly enhance cross-domain training accuracy, especially for medium-confidence detections, but the magnitude of improvement depends on the test set due to differences in false-positive behavior. The findings offer actionable guidance for designing synthetic data—emphasizing diversity in poses, occlusions, and backgrounds—to better substitute real data in cross-domain scenarios and motivate broader adoption of synthetic data strategies.

Abstract

The potential of synthetic data to replace real data creates a huge demand for synthetic data in data-hungry AI. This potential is even greater when synthetic data is used for training along with a small number of real images from domains other than the test domain. We find that this potential varies depending on (i) the number of cross-domain real images and (ii) the test set on which the trained model is evaluated. We introduce two new metrics, the train2test distance and $\text{AP}_\text{t2t}$, to evaluate the ability of a cross-domain training set using synthetic data to represent the characteristics of test instances in relation to training performance. Using these metrics, we delve deeper into the factors that influence the potential of synthetic data and uncover some interesting dynamics about how synthetic data impacts training performance. We hope these discoveries will encourage more widespread use of synthetic data.

Exploring the Potential of Synthetic Data to Replace Real Data

TL;DR

This work addresses whether synthetic data can replace real data in cross-domain training for UAV-view human detection. It introduces two metrics, train2test distance and AP, to quantify how well a training set (augmented with synthetic data via syn2real transformations) represents test instances in relation to training performance, and employs Progressive Transformation Learning (PTL) to curate synthetic samples. Across multiple UAV datasets, the study shows that synthetic data can significantly enhance cross-domain training accuracy, especially for medium-confidence detections, but the magnitude of improvement depends on the test set due to differences in false-positive behavior. The findings offer actionable guidance for designing synthetic data—emphasizing diversity in poses, occlusions, and backgrounds—to better substitute real data in cross-domain scenarios and motivate broader adoption of synthetic data strategies.

Abstract

The potential of synthetic data to replace real data creates a huge demand for synthetic data in data-hungry AI. This potential is even greater when synthetic data is used for training along with a small number of real images from domains other than the test domain. We find that this potential varies depending on (i) the number of cross-domain real images and (ii) the test set on which the trained model is evaluated. We introduce two new metrics, the train2test distance and , to evaluate the ability of a cross-domain training set using synthetic data to represent the characteristics of test instances in relation to training performance. Using these metrics, we delve deeper into the factors that influence the potential of synthetic data and uncover some interesting dynamics about how synthetic data impacts training performance. We hope these discoveries will encourage more widespread use of synthetic data.
Paper Structure (9 sections, 4 equations, 4 figures)

This paper contains 9 sections, 4 equations, 4 figures.

Figures (4)

  • Figure 1: The ability of synthetic data to replace real data. Each bar shows how much more (same-domain) real data can be replaced when synthetic data is used in training while maintaining the same detection performance. '# cross-domain image' indicates the number of cross-domain real images used in training, along with synthetic data. The details of measuring the ability of synthetic data are given in Sec. \ref{['ssec:accuracy_match']}.
  • Figure 2: Number of images from the same- and cross-domain providing equivalent training performance. The figures show the matching numbers for both with and without synthetic data cases. The increases in matching numbers when using synthetic data are shown in Fig. \ref{['fig:impact_of_synth']}. For all experiments, the average of three runs is reported to address random effects that may arise when choosing a specific number of real cross-domain training images.
  • Figure 3: Scaling behavior of real training data in terms of AP$_\text{t2t}$. Here, 'high', 'med', and 'all' represent the high-confidence detections, the above-medium-confidence detections, and all potential detections with minimum confidence score, respectively. The $y$-axis is shown in a logarithmic scale to better focus on the scaling behavior seen at low APs.
  • Figure 4: train2test distribution of TP and FP. The top and bottom rows show the histograms for TP and FP, respectively. Each setting represents "XXX-N", where XXX is an abbreviation for the training dataset name and N is the number of real training images, e.g., Vis-50 refers to a training set containing 50 VisDrone images.