DreamHuman: Animatable 3D Avatars from Text

Nikos Kolotouros; Thiemo Alldieck; Andrei Zanfir; Eduard Gabriel Bazavan; Mihai Fieraru; Cristian Sminchisescu

DreamHuman: Animatable 3D Avatars from Text

Nikos Kolotouros, Thiemo Alldieck, Andrei Zanfir, Eduard Gabriel Bazavan, Mihai Fieraru, Cristian Sminchisescu

TL;DR

DreamHuman presents a text-driven pipeline for animatable 3D human avatars that fuses diffusion-guided synthesis, neural radiance fields, and the imGHUM body prior. By conditioning a Deformable NeRF on pose and shape, and employing semantic zoom and multiple regularizing losses, it achieves high-fidelity, pose-aware clothing deformations without supervised text-to-3D data. The approach demonstrates superior geometry and texture quality against DreamFusion and AvatarCLIP, and supports diverse appearances and poses. Practical impact includes enabling artists and synthetic-data generation, with attention to ethical considerations and potential misuse.

Abstract

We present DreamHuman, a method to generate realistic animatable 3D human avatar models solely from textual descriptions. Recent text-to-3D methods have made considerable strides in generation, but are still lacking in important aspects. Control and often spatial resolution remain limited, existing methods produce fixed rather than animated 3D human models, and anthropometric consistency for complex structures like people remains a challenge. DreamHuman connects large text-to-image synthesis models, neural radiance fields, and statistical human body models in a novel modeling and optimization framework. This makes it possible to generate dynamic 3D human avatars with high-quality textures and learned, instance-specific, surface deformations. We demonstrate that our method is capable to generate a wide variety of animatable, realistic 3D human models from text. Our 3D models have diverse appearance, clothing, skin tones and body shapes, and significantly outperform both generic text-to-3D approaches and previous text-based 3D avatar generators in visual fidelity. For more results and animations please check our website at https://dream-human.github.io.

DreamHuman: Animatable 3D Avatars from Text

TL;DR

Abstract

Paper Structure (10 sections, 8 equations, 7 figures, 1 table)

This paper contains 10 sections, 8 equations, 7 figures, 1 table.

Introduction
Related Work
Methodology
Architecture
Loss functions
Optimization
Experiments
Ablation Study
Comparison with the state of the art
Conclusion

Figures (7)

Figure 1: Example of 3D models synthesized and posed by our method. DreamHuman can produce an animatable 3D avatar given only a textual description of a human's appearance. At test time, our avatar can be reposed based on a set of 3D poses or a motion, without additional refinement.
Figure 2: 3D human avatars generated using our method given text prompts. We render each example in a random pose from two viewpoints, along with corresponding surface normal maps.
Figure 3: Overview of DreamHuman. Given a text prompt, such as a woman wearing a dress, we generate a realistic, animatable 3D avatar whose appearance and body shape match the textual description. A key component in our pipeline is a deformable and pose-conditioned NeRF model learned and constrained using imGHUM alldieck2021imghum, an implicit statistical 3D human pose and shape model. At each training step, we synthesize our avatar based on randomly sampled poses and render it from random viewpoints. The optimisation of the avatar structure is guided by the Score Distillation Sampling loss poole2022dreamfusion powered by a text-to-image generation model saharia2022photorealistic. We rely on imGHUM alldieck2021imghum to add pose control and inject anthropomorphic priors in the avatar optimisation process. We also use several other normal, mask and orientation-based losses in order to ensure coherent synthesis. NeRF, body shape, and spherical harmonics illumination parameters (in red) are optimised.
Figure 4: Importance of semantic zoom. For each example, the left image shows the generated avatar with semantic zoom, whereas the right image an avatars generated without it. Notice how the semantic zoom allows us to reconstruct sharper, higher-quality textures.
Figure 5: Importance of pose-dependent deformations and pose sampling in the NeRF model, $f(\mathbf{\Phi} , d, \mathbf{s}, \boldsymbol{\theta}, \boldsymbol{\beta})$. Our non-rigid pose-dependent deformations enable more realistic clothing when reposing the avatar. For each of the two example prompts we show two generated avatars, with and without pose-correctives. Notice how the skirt and the shorts move more naturally when reposing the avatar.
...and 2 more figures

DreamHuman: Animatable 3D Avatars from Text

TL;DR

Abstract

DreamHuman: Animatable 3D Avatars from Text

Authors

TL;DR

Abstract

Table of Contents

Figures (7)