Farm3D: Learning Articulated 3D Animals by Distilling 2D Diffusion
Tomas Jakab, Ruining Li, Shangzhe Wu, Christian Rupprecht, Andrea Vedaldi
TL;DR
Farm3D tackles the problem of learning category-level articulated 3D reconstruction from monocular images without real training data. It leverages synthetic images generated by a pre-trained 2D diffusion model and extends score-based learning (SDS) to supervise a monocular, category-specific 3D reconstructor that outputs controllable articulated meshes in a single forward pass. Key contributions include demonstrating effective category-level 3D learning from diffusion-generated data, introducing an Animodel synthetic dataset for direct 3D evaluation, and enabling fast, editable 3D asset synthesis (lighting, texture, articulation) at test time. The approach reduces data collection burdens and delivers practical 3D reconstruction and synthesis suitable for games and visualization, while maintaining competitive performance with methods trained on real data.
Abstract
We present Farm3D, a method for learning category-specific 3D reconstructors for articulated objects, relying solely on "free" virtual supervision from a pre-trained 2D diffusion-based image generator. Recent approaches can learn a monocular network that predicts the 3D shape, albedo, illumination, and viewpoint of any object occurrence, given a collection of single-view images of an object category. However, these approaches heavily rely on manually curated clean training data, which are expensive to obtain. We propose a framework that uses an image generator, such as Stable Diffusion, to generate synthetic training data that are sufficiently clean and do not require further manual curation, enabling the learning of such a reconstruction network from scratch. Additionally, we incorporate the diffusion model as a score to enhance the learning process. The idea involves randomizing certain aspects of the reconstruction, such as viewpoint and illumination, generating virtual views of the reconstructed 3D object, and allowing the 2D network to assess the quality of the resulting image, thus providing feedback to the reconstructor. Unlike work based on distillation, which produces a single 3D asset for each textual prompt, our approach yields a monocular reconstruction network capable of outputting a controllable 3D asset from any given image, whether real or generated, in a single forward pass in a matter of seconds. Our network can be used for analysis, including monocular reconstruction, or for synthesis, generating articulated assets for real-time applications such as video games.
