Table of Contents
Fetching ...

DreamBeast: Distilling 3D Fantastical Animals with Part-Aware Knowledge Transfer

Runjia Li, Junlin Han, Luke Melas-Kyriazi, Chunyi Sun, Zhaochong An, Zhongrui Gui, Shuyang Sun, Philip Torr, Tomas Jakab

TL;DR

DreamBeast addresses the challenge of part level controllability in 3D asset generation by transferring part aware knowledge from a strong 2D diffusion model into SDS. It extracts Part-Affinity maps from multi-view renderings, learns a Part-Affinity NeRF to interpolate those maps to arbitrary views, and modulates cross and self attention during SDS with the learned maps to produce part specific 3D beasts. The approach yields higher part correspondence and image quality while dramatically reducing compute time relative to naive SD3 based pipelines, as demonstrated by CLIP based metrics and user studies. The work advances open world 3D content creation by enabling flexible, part aware composition of 3D assets with practical runtimes.

Abstract

We present DreamBeast, a novel method based on score distillation sampling (SDS) for generating fantastical 3D animal assets composed of distinct parts. Existing SDS methods often struggle with this generation task due to a limited understanding of part-level semantics in text-to-image diffusion models. While recent diffusion models, such as Stable Diffusion 3, demonstrate a better part-level understanding, they are prohibitively slow and exhibit other common problems associated with single-view diffusion models. DreamBeast overcomes this limitation through a novel part-aware knowledge transfer mechanism. For each generated asset, we efficiently extract part-level knowledge from the Stable Diffusion 3 model into a 3D Part-Affinity implicit representation. This enables us to instantly generate Part-Affinity maps from arbitrary camera views, which we then use to modulate the guidance of a multi-view diffusion model during SDS to create 3D assets of fantastical animals. DreamBeast significantly enhances the quality of generated 3D creatures with user-specified part compositions while reducing computational overhead, as demonstrated by extensive quantitative and qualitative evaluations.

DreamBeast: Distilling 3D Fantastical Animals with Part-Aware Knowledge Transfer

TL;DR

DreamBeast addresses the challenge of part level controllability in 3D asset generation by transferring part aware knowledge from a strong 2D diffusion model into SDS. It extracts Part-Affinity maps from multi-view renderings, learns a Part-Affinity NeRF to interpolate those maps to arbitrary views, and modulates cross and self attention during SDS with the learned maps to produce part specific 3D beasts. The approach yields higher part correspondence and image quality while dramatically reducing compute time relative to naive SD3 based pipelines, as demonstrated by CLIP based metrics and user studies. The work advances open world 3D content creation by enabling flexible, part aware composition of 3D assets with practical runtimes.

Abstract

We present DreamBeast, a novel method based on score distillation sampling (SDS) for generating fantastical 3D animal assets composed of distinct parts. Existing SDS methods often struggle with this generation task due to a limited understanding of part-level semantics in text-to-image diffusion models. While recent diffusion models, such as Stable Diffusion 3, demonstrate a better part-level understanding, they are prohibitively slow and exhibit other common problems associated with single-view diffusion models. DreamBeast overcomes this limitation through a novel part-aware knowledge transfer mechanism. For each generated asset, we efficiently extract part-level knowledge from the Stable Diffusion 3 model into a 3D Part-Affinity implicit representation. This enables us to instantly generate Part-Affinity maps from arbitrary camera views, which we then use to modulate the guidance of a multi-view diffusion model during SDS to create 3D assets of fantastical animals. DreamBeast significantly enhances the quality of generated 3D creatures with user-specified part compositions while reducing computational overhead, as demonstrated by extensive quantitative and qualitative evaluations.
Paper Structure (31 sections, 3 equations, 17 figures, 4 tables, 1 algorithm)

This paper contains 31 sections, 3 equations, 17 figures, 4 tables, 1 algorithm.

Figures (17)

  • Figure 1: Generated fantastic 3D beasts composed of diverse animal parts. Our method enables part-level generation, resulting in 3D creatures with unique combinations of heads, limbs, wings, tails, and bodies.
  • Figure 2: Comparison of diffusion models on part-level prompt understanding in 2D generation. Although MVDream can grasp the overall semantic understanding of the described animals, the generated images often feature deformed animals and fail to accurately capture specific part-based descriptions, unlike SD3.
  • Figure 3: Failing to generate part-aware content even with part understanding in SD3. Despite its understanding of part correspondences, as evidenced by the cross-attention maps at certain timesteps $t$ and layers $l$, SD3 may still fail to generate part-aware images. This is illustrated in above examples where specific animal parts are absent, highlighted in red. Our method capitalizes on the observation that only particular timesteps $t$ and layers $l$ exhibit part-awareness.
  • Figure 4: MVDream and SD3 have difficulty generating part-aware 3D animals. While SD3 sd3 can understand part correspondence in images and text, it struggles to generate 3D assets using SDS due to the issues we discussed in our paper. MVDream mvdream falls short because it was fine-tuned on Objaverse objaverse, which lacks part-level information in the dataset.
  • Figure 5: Running speed comparison. While Dreamfusion (SD3) combined with MVDream and standalone Dreamfusion (SD3) take 480 and 420 minutes respectively, our method significantly reduces the runtime to 78 minutes. This reduction is achieved without sacrificing part-awareness making our method both faster and more effective in part-aware 3D generation. More detail in Appendix \ref{['as:speed']}.
  • ...and 12 more figures