Table of Contents
Fetching ...

YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals

Sandeep Mishra, Oindrila Saha, Alan C. Bovik

TL;DR

YouDream tackles the challenge of generating anatomically plausible 3D animals from text by introducing a $2D$ pose-conditioned diffusion approach guided by a $3D$ pose prior. The method combines a $TetraPose$ ControlNet trained on tetrapod poses with a multi-agent LLM that maps animal names to feasible 3D poses from a compact pose library, plus a pose editor and a shape initializer to bootstrap NeRF. A diffusion-guided NeRF optimization (SDS) with $2D$ pose projections and scheduled control/guidance scales yields geometrically consistent and visually natural animals, including unreal creatures, without requiring $3D training data. Empirical results show YouDream outperforms baselines in naturalness and text-image alignment, supported by a user study and CLIP evaluations, and the pipeline supports automated generation of common animals as well as pose-editable unreal designs. The work contributes a practical, automated framework for anatomically coherent 3D animal generation guided by 3D pose priors, with broad implications for creative design and content creation in 3D, AR/VR, and gaming contexts.

Abstract

3D generation guided by text-to-image diffusion models enables the creation of visually compelling assets. However previous methods explore generation based on image or text. The boundaries of creativity are limited by what can be expressed through words or the images that can be sourced. We present YouDream, a method to generate high-quality anatomically controllable animals. YouDream is guided using a text-to-image diffusion model controlled by 2D views of a 3D pose prior. Our method generates 3D animals that are not possible to create using previous text-to-3D generative methods. Additionally, our method is capable of preserving anatomic consistency in the generated animals, an area where prior text-to-3D approaches often struggle. Moreover, we design a fully automated pipeline for generating commonly found animals. To circumvent the need for human intervention to create a 3D pose, we propose a multi-agent LLM that adapts poses from a limited library of animal 3D poses to represent the desired animal. A user study conducted on the outcomes of YouDream demonstrates the preference of the animal models generated by our method over others. Turntable results and code are released at https://youdream3d.github.io/

YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals

TL;DR

YouDream tackles the challenge of generating anatomically plausible 3D animals from text by introducing a pose-conditioned diffusion approach guided by a pose prior. The method combines a ControlNet trained on tetrapod poses with a multi-agent LLM that maps animal names to feasible 3D poses from a compact pose library, plus a pose editor and a shape initializer to bootstrap NeRF. A diffusion-guided NeRF optimization (SDS) with pose projections and scheduled control/guidance scales yields geometrically consistent and visually natural animals, including unreal creatures, without requiring $3D training data. Empirical results show YouDream outperforms baselines in naturalness and text-image alignment, supported by a user study and CLIP evaluations, and the pipeline supports automated generation of common animals as well as pose-editable unreal designs. The work contributes a practical, automated framework for anatomically coherent 3D animal generation guided by 3D pose priors, with broad implications for creative design and content creation in 3D, AR/VR, and gaming contexts.

Abstract

3D generation guided by text-to-image diffusion models enables the creation of visually compelling assets. However previous methods explore generation based on image or text. The boundaries of creativity are limited by what can be expressed through words or the images that can be sourced. We present YouDream, a method to generate high-quality anatomically controllable animals. YouDream is guided using a text-to-image diffusion model controlled by 2D views of a 3D pose prior. Our method generates 3D animals that are not possible to create using previous text-to-3D generative methods. Additionally, our method is capable of preserving anatomic consistency in the generated animals, an area where prior text-to-3D approaches often struggle. Moreover, we design a fully automated pipeline for generating commonly found animals. To circumvent the need for human intervention to create a 3D pose, we propose a multi-agent LLM that adapts poses from a limited library of animal 3D poses to represent the desired animal. A user study conducted on the outcomes of YouDream demonstrates the preference of the animal models generated by our method over others. Turntable results and code are released at https://youdream3d.github.io/
Paper Structure (18 sections, 5 equations, 21 figures, 1 table)

This paper contains 18 sections, 5 equations, 21 figures, 1 table.

Figures (21)

  • Figure 1: Creating unreal creatures. Our method generates imaginary creatures based on an artist's creative control. We show that these creatures cannot be generated faithfully only based on text. Each row depicts a 3D animal generated by HiFA, MVDream, and YouDream (left to right) using the prompt mentioned below the row. We present 3D pose controls used to create these in the Sec. \ref{['sec:implementation_details_suppl']} (results best viewed zoomed in).
  • Figure 2: Automatic pipeline for 3D animal generation. Given the name of an animal and textual pose description, we utilize a multi-agent LLM to generate a 3D pose ($\phi$) supported by a small library of animal names paired with 3D poses. With the obtained 3D pose, we train a NeRF to generate the 3D animal guided by a diffusion model controlled by 2D views ($\phi^{proj}$) of $\phi$.
  • Figure 3: Qualitative examples of pose editing using multi-agent LLM setup. For each example, the green box denotes the desired animal, while the blue box is the animal retrieved from the 3D pose library by Finder LLM ($\pi_{F}$). We show the pose modification performed by the joint effort of Observer ($\pi_{O}$) and Modifier ($\pi_{M}$) for three instances.
  • Figure 4: Comparison on generating animals observed in nature. We compare with baselines which use T2I diffusion (with official open-source code) for the automatic generation of text-to-3D animals. Unlike the baselines, our method produces high quality anatomically consistent animals.
  • Figure 5: User Study. User preferences on 1) Naturalness and 2) Text-Image alignment averaged over 32 participants and 22 text-to-3D generated assets reveals the superiority of our proposed method.
  • ...and 16 more figures