Table of Contents
Fetching ...

SynPlay: Large-Scale Synthetic Human Data with Real-World Diversity for Aerial-View Perception

Jinsub Yim, Hyungtae Lee, Sungmin Eum, Yi-Ting Shen, Yan Zhang, Heesung Kwon, Shuvra S. Bhattacharyya

TL;DR

SynPlay addresses the challenge of locating humans from aerial views where subjects are tiny by introducing a rule-guided motion generation framework that blends real-motion capture with motion evolution graphs, producing uncountably many dynamic behaviors. It deploys a multi-perspective setup with UAVs, CCTVs, and a UGV to achieve near-to-far coverage and captures 73,892 images with 6.5M annotated human instances across six games. Extensive experiments show substantial improvements in aerial-view detection and segmentation, especially in data-scarce and cross-domain scenarios, and high fidelity as evidenced by FID analyses. The dataset’s combination of diverse motion, real-world-like interactions, and multi-view capture offers a new benchmark for long-range aerial perception and data-efficient learning.

Abstract

We introduce SynPlay, a large-scale synthetic human dataset purpose-built for advancing multi-perspective human localization, with a predominant focus on aerial-view perception. SynPlay departs from traditional synthetic datasets by addressing a critical but underexplored challenge: localizing humans in aerial scenes where subjects often occupy only tens of pixels in the image. In such scenarios, fine-grained details like facial features or textures become irrelevant, shifting the burden of recognition to human motion, behavior, and interactions. To meet this need, SynPlay implements a novel rule-guided motion generation framework that combines real-world motion capture with motion evolution graphs. This design enables human actions to evolve dynamically through high-level game rules rather than predefined scripts, resulting in effectively uncountable motion variations. Unlike existing synthetic datasets-which either focus on static visual traits or reuse a limited set of mocap-driven actions-SynPlay captures a wide spectrum of spontaneous behaviors, including complex interactions that naturally emerge from unscripted gameplay scenarios. SynPlay also introduces an extensive multi-camera setup that spans UAVs at random altitudes, CCTVs, and a freely roaming UGV, achieving true near-to-far perspective coverage in a single dataset. The majority of instances are captured from aerial viewpoints at varying scales, directly supporting the development of models for long-range human analysis-a setting where existing datasets fall short. Our data contains over 73k images and 6.5M human instances, with detailed annotations for detection, segmentation, and keypoint tasks. Extensive experiments demonstrate that training with SynPlay significantly improves human localization performance, especially in few-shot and data-scarce scenarios.

SynPlay: Large-Scale Synthetic Human Data with Real-World Diversity for Aerial-View Perception

TL;DR

SynPlay addresses the challenge of locating humans from aerial views where subjects are tiny by introducing a rule-guided motion generation framework that blends real-motion capture with motion evolution graphs, producing uncountably many dynamic behaviors. It deploys a multi-perspective setup with UAVs, CCTVs, and a UGV to achieve near-to-far coverage and captures 73,892 images with 6.5M annotated human instances across six games. Extensive experiments show substantial improvements in aerial-view detection and segmentation, especially in data-scarce and cross-domain scenarios, and high fidelity as evidenced by FID analyses. The dataset’s combination of diverse motion, real-world-like interactions, and multi-view capture offers a new benchmark for long-range aerial perception and data-efficient learning.

Abstract

We introduce SynPlay, a large-scale synthetic human dataset purpose-built for advancing multi-perspective human localization, with a predominant focus on aerial-view perception. SynPlay departs from traditional synthetic datasets by addressing a critical but underexplored challenge: localizing humans in aerial scenes where subjects often occupy only tens of pixels in the image. In such scenarios, fine-grained details like facial features or textures become irrelevant, shifting the burden of recognition to human motion, behavior, and interactions. To meet this need, SynPlay implements a novel rule-guided motion generation framework that combines real-world motion capture with motion evolution graphs. This design enables human actions to evolve dynamically through high-level game rules rather than predefined scripts, resulting in effectively uncountable motion variations. Unlike existing synthetic datasets-which either focus on static visual traits or reuse a limited set of mocap-driven actions-SynPlay captures a wide spectrum of spontaneous behaviors, including complex interactions that naturally emerge from unscripted gameplay scenarios. SynPlay also introduces an extensive multi-camera setup that spans UAVs at random altitudes, CCTVs, and a freely roaming UGV, achieving true near-to-far perspective coverage in a single dataset. The majority of instances are captured from aerial viewpoints at varying scales, directly supporting the development of models for long-range human analysis-a setting where existing datasets fall short. Our data contains over 73k images and 6.5M human instances, with detailed annotations for detection, segmentation, and keypoint tasks. Extensive experiments demonstrate that training with SynPlay significantly improves human localization performance, especially in few-shot and data-scarce scenarios.
Paper Structure (22 sections, 15 figures, 7 tables)

This paper contains 22 sections, 15 figures, 7 tables.

Figures (15)

  • Figure 1: SynPlay captures players performing six traditional games in a virtual playground, also featured in the Netflix series "Squid Game" SquidGameNetflix2021. It promotes motion diversity via rule-guided motion generation, where actions evolve dynamically through game mechanics. A multi-perspective setup enables near-to-far viewpoint coverage while capturing diverse behaviors and appearances across angles.
  • Figure 2: Game sequence generation pipeline. This illustrate how we create a sequence for a tug-of-war game which includes an example of how we incorporate real-world motions towards the elementary motion state of pull. In the motion evolution graph, the start and end nodes are indicated by green and red circles, respectively. A diverse set of pull motion instances is shown below the image of the rendered scene.
  • Figure 3: Multiple viewpoints used in SynPlay. Multiple camera viewpoints allow substantial variations in appearance for the same human subject with identical pose.
  • Figure 4: Scaling behavior of synthetic datasets under the Vis-20 setup (AP$_\text{50}^\text{bb}$). Scaling behavior of each dataset is compared by randomly sampled subsets of 1,080, 4,320, and 17,280 images, which correspond to 1/16th the size, 1/4th the size, and the size of Archangel. For reference, the sizes of Archange* and SynPlay are 34,994 and 73,892, respectively.
  • Figure 5: 456 virtual players in SynPlay created using Character Creator.
  • ...and 10 more figures