Learning the 3D Fauna of the Web

Zizhang Li; Dor Litvak; Ruining Li; Yunzhi Zhang; Tomas Jakab; Christian Rupprecht; Shangzhe Wu; Andrea Vedaldi; Jiajun Wu

Learning the 3D Fauna of the Web

Zizhang Li, Dor Litvak, Ruining Li, Yunzhi Zhang, Tomas Jakab, Christian Rupprecht, Shangzhe Wu, Andrea Vedaldi, Jiajun Wu

TL;DR

3D-Fauna tackles the ambitious task of learning a pan-category deformable 3D model for over 100 quadruped species using only 2D Internet images. It introduces the Semantic Bank of Skinned Models (SBSM), which automatically discovers a compact set of base shapes by leveraging self-supervised features, enabling category-agnostic shape priors. The model reconstructs articulated 3D meshes from a single image in a feed-forward pass and is trained on a new Fauna Dataset spanning 128 species, achieving strong qualitative and quantitative results with ablations validating the SBSM and a mask-based regularizer. This work significantly advances scalable 3D animal modeling from uncontrolled Internet data and paves the way toward comprehensive biodiversity reconstruction, albeit within a quadruped skeletal constraint and with some data curation requirements.

Abstract

Learning 3D models of all animals on the Earth requires massively scaling up existing solutions. With this ultimate goal in mind, we develop 3D-Fauna, an approach that learns a pan-category deformable 3D animal model for more than 100 animal species jointly. One crucial bottleneck of modeling animals is the limited availability of training data, which we overcome by simply learning from 2D Internet images. We show that prior category-specific attempts fail to generalize to rare species with limited training images. We address this challenge by introducing the Semantic Bank of Skinned Models (SBSM), which automatically discovers a small set of base animal shapes by combining geometric inductive priors with semantic knowledge implicitly captured by an off-the-shelf self-supervised feature extractor. To train such a model, we also contribute a new large-scale dataset of diverse animal species. At inference time, given a single image of any quadruped animal, our model reconstructs an articulated 3D mesh in a feed-forward fashion within seconds.

Learning the 3D Fauna of the Web

TL;DR

Abstract

Paper Structure (49 sections, 7 equations, 15 figures, 5 tables)

This paper contains 49 sections, 7 equations, 15 figures, 5 tables.

Introduction
Related Work
Optimization-Based 3D Reconstruction of Animals.
Learning 3D from Internet Images and Videos.
Animal Datasets.
Method
Semantic Bank of Skinned Models
Skinned Models (SM).
Semantic Bank of Skinned Models.
Implementation Details.
Appearance.
Learning Formulation
Reconstruction Losses.
Correspondences from Self-Supervised Features.
Mask Discriminator.
...and 34 more sections

Figures (15)

Figure 1: Learning Diverse 3D Animals from the Internet. Our method, 3D-Fauna, learns a pan-category deformable 3D model of more than 100 different animal species using only 2D Internet images as training data. At test time, the model can turn a single image of an quadruped instance into an articulated, textured 3D mesh in a feed-forward manner, ready for animation and rendering.
Figure 2: Training Pipeline. 3D-Fauna is trained using only single-view images from the Internet. Given each input image, it first extracts a feature vector $\phi$ using a pre-trained unsupervised image encoder caron2021emerging. This is then used to query a learned memory bank to produce a base shape and a DINO feature field in the canonical pose. The model also predicts the albedo, instance-specific deformation, articulated pose and lighting, and is trained via image reconstruction losses on RGB, DINO feature map and mask, as well as a mask discriminator loss.
Figure 3: Queries from the Semantic Base Shape Bank. Without requiring any category labels, the Semantic Bank (Sec \ref{['s:sbsm']}) automatically learns diverse base shapes for various animals and preserves the semantic similarities across different instances.
Figure 4: Single Image 3D Reconstruction. Given a single image of any quadruped animal at test time, our model reconstructs an articulated and textured 3D mesh in a feed-forward manner without requiring category labels, which can be readily animated.
Figure 5: Qualitative Comparisons against MagicPony wu2023magicpony, LASSIE yao2022lassie, Hi-LASSIE yao2023hi and Zero-1-to-3 liu2023zero. Compared to all baselines, our method predicts more stable poses and higher-fidelity reconstructions. Note that our method is learning-based and predicts 3D meshes in a feed-forward fashion (as opposed to yao2022lassieyao2023hi that optimize on test images), which is orders of magnitude faster.
...and 10 more figures

Learning the 3D Fauna of the Web

TL;DR

Abstract

Learning the 3D Fauna of the Web

Authors

TL;DR

Abstract

Table of Contents

Figures (15)