Table of Contents
Fetching ...

ZebraPose: Zebra Detection and Pose Estimation using only Synthetic Data

Elia Bonetto, Aamir Ahmad

TL;DR

The ZebraPose study tackles the scarcity of labeled wildlife data and the need for aerial viewpoints by training detection and 2D pose estimation models entirely on synthetic data generated from a photorealistic simulator. It introduces a unified pipeline that uses a cropped-and-scaled augmentation to create SC_5K for robust detection with YOLOv5s and extracts 27 2D keypoints from 3D zebra meshes to train ViTPose+ from scratch or with pretraining. Across extensive benchmarks on real-world and aerial zebra datasets, synthetic data alone achieves competitive performance and can surpass real-data baselines when combined with a small amount of real data, validating the syn-to-real generalization. The work highlights the practicality of scalable synthetic data for wildlife monitoring and aerial analysis, reducing reliance on costly real labeling and enabling broader, faster experimentation with new species and viewpoints.

Abstract

Collecting and labeling large real-world wild animal datasets is impractical, costly, error-prone, and labor-intensive. For animal monitoring tasks, as detection, tracking, and pose estimation, out-of-distribution viewpoints (e.g. aerial) are also typically needed but rarely found in publicly available datasets. To solve this, existing approaches synthesize data with simplistic techniques that then necessitate strategies to bridge the synthetic-to-real gap. Therefore, real images, style constraints, complex animal models, or pre-trained networks are often leveraged. In contrast, we generate a fully synthetic dataset using a 3D photorealistic simulator and demonstrate that it can eliminate such needs for detecting and estimating 2D poses of wild zebras. Moreover, existing top-down 2D pose estimation approaches using synthetic data assume reliable detection models. However, these often fail in out-of-distribution scenarios, e.g. those that include wildlife or aerial imagery. Our method overcomes this by enabling the training of both tasks using the same synthetic dataset. Through extensive benchmarks, we show that models trained from scratch exclusively on our synthetic data generalize well to real images. We perform these using multiple real-world and synthetic datasets, pre-trained and randomly initialized backbones, and different image resolutions. Code, results, models, and data can be found athttps://zebrapose.is.tue.mpg.de/.

ZebraPose: Zebra Detection and Pose Estimation using only Synthetic Data

TL;DR

The ZebraPose study tackles the scarcity of labeled wildlife data and the need for aerial viewpoints by training detection and 2D pose estimation models entirely on synthetic data generated from a photorealistic simulator. It introduces a unified pipeline that uses a cropped-and-scaled augmentation to create SC_5K for robust detection with YOLOv5s and extracts 27 2D keypoints from 3D zebra meshes to train ViTPose+ from scratch or with pretraining. Across extensive benchmarks on real-world and aerial zebra datasets, synthetic data alone achieves competitive performance and can surpass real-data baselines when combined with a small amount of real data, validating the syn-to-real generalization. The work highlights the practicality of scalable synthetic data for wildlife monitoring and aerial analysis, reducing reliance on costly real labeling and enabling broader, faster experimentation with new species and viewpoints.

Abstract

Collecting and labeling large real-world wild animal datasets is impractical, costly, error-prone, and labor-intensive. For animal monitoring tasks, as detection, tracking, and pose estimation, out-of-distribution viewpoints (e.g. aerial) are also typically needed but rarely found in publicly available datasets. To solve this, existing approaches synthesize data with simplistic techniques that then necessitate strategies to bridge the synthetic-to-real gap. Therefore, real images, style constraints, complex animal models, or pre-trained networks are often leveraged. In contrast, we generate a fully synthetic dataset using a 3D photorealistic simulator and demonstrate that it can eliminate such needs for detecting and estimating 2D poses of wild zebras. Moreover, existing top-down 2D pose estimation approaches using synthetic data assume reliable detection models. However, these often fail in out-of-distribution scenarios, e.g. those that include wildlife or aerial imagery. Our method overcomes this by enabling the training of both tasks using the same synthetic dataset. Through extensive benchmarks, we show that models trained from scratch exclusively on our synthetic data generalize well to real images. We perform these using multiple real-world and synthetic datasets, pre-trained and randomly initialized backbones, and different image resolutions. Code, results, models, and data can be found athttps://zebrapose.is.tue.mpg.de/.
Paper Structure (11 sections, 13 figures, 5 tables)

This paper contains 11 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: A sample of our synthetic data. Zoomed inset: an individual with all the 27 keypoints labeled.
  • Figure 2: Examples of annotation errors in the APT-36K dataset.
  • Figure 3: An example of before [left] and after [right] the cropping and scaling procedure (\ref{['sec:detection_aug']}).
  • Figure 4: Cumulative Distributions of height and width ratio w.r.t. image size for SC and SC$_{\text{5K}}$ datasets.
  • Figure 5: YOLOv5s results on images from the APT-36K (top row) and our R123 (bottom row) datasets, using $1920\times1920$ resolution.
  • ...and 8 more figures