Table of Contents
Fetching ...

Farm3D: Learning Articulated 3D Animals by Distilling 2D Diffusion

Tomas Jakab, Ruining Li, Shangzhe Wu, Christian Rupprecht, Andrea Vedaldi

TL;DR

Farm3D tackles the problem of learning category-level articulated 3D reconstruction from monocular images without real training data. It leverages synthetic images generated by a pre-trained 2D diffusion model and extends score-based learning (SDS) to supervise a monocular, category-specific 3D reconstructor that outputs controllable articulated meshes in a single forward pass. Key contributions include demonstrating effective category-level 3D learning from diffusion-generated data, introducing an Animodel synthetic dataset for direct 3D evaluation, and enabling fast, editable 3D asset synthesis (lighting, texture, articulation) at test time. The approach reduces data collection burdens and delivers practical 3D reconstruction and synthesis suitable for games and visualization, while maintaining competitive performance with methods trained on real data.

Abstract

We present Farm3D, a method for learning category-specific 3D reconstructors for articulated objects, relying solely on "free" virtual supervision from a pre-trained 2D diffusion-based image generator. Recent approaches can learn a monocular network that predicts the 3D shape, albedo, illumination, and viewpoint of any object occurrence, given a collection of single-view images of an object category. However, these approaches heavily rely on manually curated clean training data, which are expensive to obtain. We propose a framework that uses an image generator, such as Stable Diffusion, to generate synthetic training data that are sufficiently clean and do not require further manual curation, enabling the learning of such a reconstruction network from scratch. Additionally, we incorporate the diffusion model as a score to enhance the learning process. The idea involves randomizing certain aspects of the reconstruction, such as viewpoint and illumination, generating virtual views of the reconstructed 3D object, and allowing the 2D network to assess the quality of the resulting image, thus providing feedback to the reconstructor. Unlike work based on distillation, which produces a single 3D asset for each textual prompt, our approach yields a monocular reconstruction network capable of outputting a controllable 3D asset from any given image, whether real or generated, in a single forward pass in a matter of seconds. Our network can be used for analysis, including monocular reconstruction, or for synthesis, generating articulated assets for real-time applications such as video games.

Farm3D: Learning Articulated 3D Animals by Distilling 2D Diffusion

TL;DR

Farm3D tackles the problem of learning category-level articulated 3D reconstruction from monocular images without real training data. It leverages synthetic images generated by a pre-trained 2D diffusion model and extends score-based learning (SDS) to supervise a monocular, category-specific 3D reconstructor that outputs controllable articulated meshes in a single forward pass. Key contributions include demonstrating effective category-level 3D learning from diffusion-generated data, introducing an Animodel synthetic dataset for direct 3D evaluation, and enabling fast, editable 3D asset synthesis (lighting, texture, articulation) at test time. The approach reduces data collection burdens and delivers practical 3D reconstruction and synthesis suitable for games and visualization, while maintaining competitive performance with methods trained on real data.

Abstract

We present Farm3D, a method for learning category-specific 3D reconstructors for articulated objects, relying solely on "free" virtual supervision from a pre-trained 2D diffusion-based image generator. Recent approaches can learn a monocular network that predicts the 3D shape, albedo, illumination, and viewpoint of any object occurrence, given a collection of single-view images of an object category. However, these approaches heavily rely on manually curated clean training data, which are expensive to obtain. We propose a framework that uses an image generator, such as Stable Diffusion, to generate synthetic training data that are sufficiently clean and do not require further manual curation, enabling the learning of such a reconstruction network from scratch. Additionally, we incorporate the diffusion model as a score to enhance the learning process. The idea involves randomizing certain aspects of the reconstruction, such as viewpoint and illumination, generating virtual views of the reconstructed 3D object, and allowing the 2D network to assess the quality of the resulting image, thus providing feedback to the reconstructor. Unlike work based on distillation, which produces a single 3D asset for each textual prompt, our approach yields a monocular reconstruction network capable of outputting a controllable 3D asset from any given image, whether real or generated, in a single forward pass in a matter of seconds. Our network can be used for analysis, including monocular reconstruction, or for synthesis, generating articulated assets for real-time applications such as video games.
Paper Structure (42 sections, 6 equations, 19 figures, 5 tables)

This paper contains 42 sections, 6 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Learning to Reconstruct 3D Animal Categories Purely from Synthetic Images. Our method learns to reconstruct articulated and textured animals from single images, using only virtual supervision from an off-the-shelf diffusion-based 2D image generator, without the curation of any real training images. The method generalizes to a wide range of animal categories, such as cows, horses, sheep, pigs, and dogs. Moreover, it can be used for controllable 3D synthesis. For instance, we can relight our generated 3D assets, swap their textures by conditioning on another input, and animate our articulated shapes, giving us more precise control over the generated assets.
  • Figure 2: Examples of typical unsuitable images from ImageNet.
  • Figure 3: Training Pipeline. We prompt Stable Diffusion for synthetic images of an object category that are then used to train a monocular articulated object reconstruction model that factorises the input image of an object instance into articulated shape, appearance (albedo and diffuse and ambient intensities), viewpoint, and light direction. During training, we also sample virtual instance views that are then "critiqued" by Stable Diffusion to guide the learning.
  • Figure 4: Synthetic Training Images Generated with Stable Diffusion. The generated animals are typically without occlusions but sometimes anatomically incorrect (e.g., columns 2 and 3), but our model is robust to this and learns plausible 3D shapes.
  • Figure 5: Noise Scheduling and SDS Gradient. We show four rendered images $\hat{I}$, obtained by first sampling a random viewpoint and illumination as for training our model. Then, we pick a fixed noise sample $\epsilon$ and show the SDS gradient $(\hat{\epsilon}_t(z_t|y) - \epsilon) ({\partial h}/{\partial \hat{I}})$ used to update $\hat{I}$ in \ref{['e:sds']} for different values of $\sigma_t$. Because $\epsilon$ is fixed, $z_t = \alpha_t h(\hat{I}) + \sigma_t \epsilon$ only depends on $\sigma_t$. Large noise levels ($t=0.9$) generate an update which is essentially independent of the input image $\hat{I}$. Lower noise levels provide more meaningful gradients and lead to more stable training as shown in \ref{['fig:ablation-sds_noise']}.
  • ...and 14 more figures