SynPlay: Large-Scale Synthetic Human Data with Real-World Diversity for Aerial-View Perception
Jinsub Yim, Hyungtae Lee, Sungmin Eum, Yi-Ting Shen, Yan Zhang, Heesung Kwon, Shuvra S. Bhattacharyya
TL;DR
SynPlay addresses the challenge of locating humans from aerial views where subjects are tiny by introducing a rule-guided motion generation framework that blends real-motion capture with motion evolution graphs, producing uncountably many dynamic behaviors. It deploys a multi-perspective setup with UAVs, CCTVs, and a UGV to achieve near-to-far coverage and captures 73,892 images with 6.5M annotated human instances across six games. Extensive experiments show substantial improvements in aerial-view detection and segmentation, especially in data-scarce and cross-domain scenarios, and high fidelity as evidenced by FID analyses. The dataset’s combination of diverse motion, real-world-like interactions, and multi-view capture offers a new benchmark for long-range aerial perception and data-efficient learning.
Abstract
We introduce SynPlay, a large-scale synthetic human dataset purpose-built for advancing multi-perspective human localization, with a predominant focus on aerial-view perception. SynPlay departs from traditional synthetic datasets by addressing a critical but underexplored challenge: localizing humans in aerial scenes where subjects often occupy only tens of pixels in the image. In such scenarios, fine-grained details like facial features or textures become irrelevant, shifting the burden of recognition to human motion, behavior, and interactions. To meet this need, SynPlay implements a novel rule-guided motion generation framework that combines real-world motion capture with motion evolution graphs. This design enables human actions to evolve dynamically through high-level game rules rather than predefined scripts, resulting in effectively uncountable motion variations. Unlike existing synthetic datasets-which either focus on static visual traits or reuse a limited set of mocap-driven actions-SynPlay captures a wide spectrum of spontaneous behaviors, including complex interactions that naturally emerge from unscripted gameplay scenarios. SynPlay also introduces an extensive multi-camera setup that spans UAVs at random altitudes, CCTVs, and a freely roaming UGV, achieving true near-to-far perspective coverage in a single dataset. The majority of instances are captured from aerial viewpoints at varying scales, directly supporting the development of models for long-range human analysis-a setting where existing datasets fall short. Our data contains over 73k images and 6.5M human instances, with detailed annotations for detection, segmentation, and keypoint tasks. Extensive experiments demonstrate that training with SynPlay significantly improves human localization performance, especially in few-shot and data-scarce scenarios.
