Table of Contents
Fetching ...

Generative Zoo

Tomasz Niewiadomski, Anastasios Yiannakidis, Hanz Cuevas-Velasquez, Soubhik Sanyal, Michael J. Black, Silvia Zuffi, Peter Kulits

TL;DR

The paper tackles the scarcity and noisiness of labeled 3D animal pose and shape data by proposing GenZoo, a pipeline that uses a language-conditioned image-generation model to produce paired images and ground-truth SMAL parameters. It builds GenZoo, a million-sample synthetic dataset, and trains a 3D pose/shape regressor entirely on synthetic data, achieving state-of-the-art results on real-world Animal3D without real training images. Key contributions include the GenZoo dataset, GenZoo-Felidae synthetic test set, and a controllable generation workflow combining SMAL, CLIP/AWOL shape sampling, BITE pose extraction, prompt synthesis, and ControlNet-guided diffusion. The work demonstrates scalable, realistic, cross-species 3D animal pose/shape estimation with potential impact on wildlife monitoring, behavior analysis, and veterinary applications.

Abstract

The model-based estimation of 3D animal pose and shape from images enables computational modeling of animal behavior. Training models for this purpose requires large amounts of labeled image data with precise pose and shape annotations. However, capturing such data requires the use of multi-view or marker-based motion-capture systems, which are impractical to adapt to wild animals in situ and impossible to scale across a comprehensive set of animal species. Some have attempted to address the challenge of procuring training data by pseudo-labeling individual real-world images through manual 2D annotation, followed by 3D-parameter optimization to those labels. While this approach may produce silhouette-aligned samples, the obtained pose and shape parameters are often implausible due to the ill-posed nature of the monocular fitting problem. Sidestepping real-world ambiguity, others have designed complex synthetic-data-generation pipelines leveraging video-game engines and collections of artist-designed 3D assets. Such engines yield perfect ground-truth annotations but are often lacking in visual realism and require considerable manual effort to adapt to new species or environments. Motivated by these shortcomings, we propose an alternative approach to synthetic-data generation: rendering with a conditional image-generation model. We introduce a pipeline that samples a diverse set of poses and shapes for a variety of mammalian quadrupeds and generates realistic images with corresponding ground-truth pose and shape parameters. To demonstrate the scalability of our approach, we introduce GenZoo, a synthetic dataset containing one million images of distinct subjects. We train a 3D pose and shape regressor on GenZoo, which achieves state-of-the-art performance on a real-world animal pose and shape estimation benchmark, despite being trained solely on synthetic data. https://genzoo.is.tue.mpg.de

Generative Zoo

TL;DR

The paper tackles the scarcity and noisiness of labeled 3D animal pose and shape data by proposing GenZoo, a pipeline that uses a language-conditioned image-generation model to produce paired images and ground-truth SMAL parameters. It builds GenZoo, a million-sample synthetic dataset, and trains a 3D pose/shape regressor entirely on synthetic data, achieving state-of-the-art results on real-world Animal3D without real training images. Key contributions include the GenZoo dataset, GenZoo-Felidae synthetic test set, and a controllable generation workflow combining SMAL, CLIP/AWOL shape sampling, BITE pose extraction, prompt synthesis, and ControlNet-guided diffusion. The work demonstrates scalable, realistic, cross-species 3D animal pose/shape estimation with potential impact on wildlife monitoring, behavior analysis, and veterinary applications.

Abstract

The model-based estimation of 3D animal pose and shape from images enables computational modeling of animal behavior. Training models for this purpose requires large amounts of labeled image data with precise pose and shape annotations. However, capturing such data requires the use of multi-view or marker-based motion-capture systems, which are impractical to adapt to wild animals in situ and impossible to scale across a comprehensive set of animal species. Some have attempted to address the challenge of procuring training data by pseudo-labeling individual real-world images through manual 2D annotation, followed by 3D-parameter optimization to those labels. While this approach may produce silhouette-aligned samples, the obtained pose and shape parameters are often implausible due to the ill-posed nature of the monocular fitting problem. Sidestepping real-world ambiguity, others have designed complex synthetic-data-generation pipelines leveraging video-game engines and collections of artist-designed 3D assets. Such engines yield perfect ground-truth annotations but are often lacking in visual realism and require considerable manual effort to adapt to new species or environments. Motivated by these shortcomings, we propose an alternative approach to synthetic-data generation: rendering with a conditional image-generation model. We introduce a pipeline that samples a diverse set of poses and shapes for a variety of mammalian quadrupeds and generates realistic images with corresponding ground-truth pose and shape parameters. To demonstrate the scalability of our approach, we introduce GenZoo, a synthetic dataset containing one million images of distinct subjects. We train a 3D pose and shape regressor on GenZoo, which achieves state-of-the-art performance on a real-world animal pose and shape estimation benchmark, despite being trained solely on synthetic data. https://genzoo.is.tue.mpg.de

Paper Structure

This paper contains 23 sections, 1 equation, 10 figures, 3 tables.

Figures (10)

  • Figure 1: We propose a pipeline for the scalable generation of realistic 3D animal pose and shape estimation training data. Training solely on data produced using our pipeline, we achieve state-of-the-art performance on a real-world 3D pose and shape estimation benchmark.
  • Figure 2: Pipeline Overview. Starting with a sampled animal name (\ref{['ssec:species']}), we sample corresponding shape parameters (\ref{['ssec:shape']}). Paired pose parameters are sampled from a set of pseudo-poses (\ref{['ssec:pose']}). Sampled camera and scene descriptions are combined with a pose caption to form a prompt (\ref{['ssec:prompt']}). Rendered control signals and the prompt are used to guide the conditional image-generation model, resulting in the final image (\ref{['ssec:generation']}).
  • Figure 3: Animal3D Reconstruction Samples. We show the input image (top), GT mesh (middle), and our model's prediction (bottom).
  • Figure 4: Taxonomy. We sample species from a subset of the mammalian Superclass Laurasiatheria. The figure displays the abbreviated taxonomical hierarchy of our sampling, where hyphens represent an empty level and the numbers are of contained species.
  • Figure 5: Qualitative Method Comparison. Predictions between our method and the baseline results (*) sourced from Animal3D xu2023animal3d.
  • ...and 5 more figures