Exploring the Impact of Synthetic Data for Aerial-view Human Detection
Hyungtae Lee, Yan Zhang, Yi-Ting Shen, Heesung Kwon, Shuvra S. Bhattacharyya
TL;DR
This work investigates how synthetic data can bolster aerial-view human detection by systematically analyzing three interacting factors: the real reference data used to measure domain gap, the synthetic data selected for training, and the synthetic data pool from which samples are drawn. The authors model the detector’s representation as a multivariate Gaussian in the feature space and define a distribution gap as a normalized Mahalanobis distance, enabling a quantitative link between domain discrepancy and post-training performance. They introduce Progressive Transformation Learning (PTL) to progressively augment training with synthetics while preserving sim2real quality, using a CycleGAN to adapt selected samples toward the current data distribution and a time-saving tuning-from-previous-iteration strategy. Across extensive experiments on five real aerial datasets and a large synthetic pool, they show that synthetic data can significantly improve learning and generalization, especially in data-scarce regimes, but that benefits depend on real-data availability, sim2real transformation quality, and the diversity and domain-gap characteristics of the synthetic pool. The study offers practical guidance for designing synthetic-data workflows to maximize learning gains and domain generalization in aerial perception tasks and beyond.
Abstract
Aerial-view human detection has a large demand for large-scale data to capture more diverse human appearances compared to ground-view human detection. Therefore, synthetic data can be a good resource to expand data, but the domain gap with real-world data is the biggest obstacle to its use in training. As a common solution to deal with the domain gap, the sim2real transformation is used, and its quality is affected by three factors: i) the real data serving as a reference when calculating the domain gap, ii) the synthetic data chosen to avoid the transformation quality degradation, and iii) the synthetic data pool from which the synthetic data is selected. In this paper, we investigate the impact of these factors on maximizing the effectiveness of synthetic data in training in terms of improving learning performance and acquiring domain generalization ability--two main benefits expected of using synthetic data. As an evaluation metric for the second benefit, we introduce a method for measuring the distribution gap between two datasets, which is derived as the normalized sum of the Mahalanobis distances of all test data. As a result, we have discovered several important findings that have never been investigated or have been used previously without accurate understanding. We expect that these findings can break the current trend of either naively using or being hesitant to use synthetic data in machine learning due to the lack of understanding, leading to more appropriate use in future research.
