ZebraPose: Zebra Detection and Pose Estimation using only Synthetic Data
Elia Bonetto, Aamir Ahmad
TL;DR
The ZebraPose study tackles the scarcity of labeled wildlife data and the need for aerial viewpoints by training detection and 2D pose estimation models entirely on synthetic data generated from a photorealistic simulator. It introduces a unified pipeline that uses a cropped-and-scaled augmentation to create SC_5K for robust detection with YOLOv5s and extracts 27 2D keypoints from 3D zebra meshes to train ViTPose+ from scratch or with pretraining. Across extensive benchmarks on real-world and aerial zebra datasets, synthetic data alone achieves competitive performance and can surpass real-data baselines when combined with a small amount of real data, validating the syn-to-real generalization. The work highlights the practicality of scalable synthetic data for wildlife monitoring and aerial analysis, reducing reliance on costly real labeling and enabling broader, faster experimentation with new species and viewpoints.
Abstract
Collecting and labeling large real-world wild animal datasets is impractical, costly, error-prone, and labor-intensive. For animal monitoring tasks, as detection, tracking, and pose estimation, out-of-distribution viewpoints (e.g. aerial) are also typically needed but rarely found in publicly available datasets. To solve this, existing approaches synthesize data with simplistic techniques that then necessitate strategies to bridge the synthetic-to-real gap. Therefore, real images, style constraints, complex animal models, or pre-trained networks are often leveraged. In contrast, we generate a fully synthetic dataset using a 3D photorealistic simulator and demonstrate that it can eliminate such needs for detecting and estimating 2D poses of wild zebras. Moreover, existing top-down 2D pose estimation approaches using synthetic data assume reliable detection models. However, these often fail in out-of-distribution scenarios, e.g. those that include wildlife or aerial imagery. Our method overcomes this by enabling the training of both tasks using the same synthetic dataset. Through extensive benchmarks, we show that models trained from scratch exclusively on our synthetic data generalize well to real images. We perform these using multiple real-world and synthetic datasets, pre-trained and randomly initialized backbones, and different image resolutions. Code, results, models, and data can be found athttps://zebrapose.is.tue.mpg.de/.
