Table of Contents
Fetching ...

AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose

Huichao Zhang, Bowen Chen, Hao Yang, Liao Qu, Xu Wang, Li Chen, Chao Long, Feida Zhu, Kang Du, Min Zheng

TL;DR

AvatarVerse tackles the challenge of generating high-fidelity, pose-controlled 3D avatars from natural language descriptions. It introduces a DensePose-conditioned ControlNet and a DensePose SDS loss to achieve pose-aware, view-consistent 3D synthesis, paired with progressive high-resolution generation and mesh refinement to elevate detail. Across qualitative results and user studies, it outperforms prior zero-shot avatar methods in geometry and texture fidelity. This work offers a practical, scalable pathway for expressive, animatable 3D avatars from text prompts and pose guidance.

Abstract

Creating expressive, diverse and high-quality 3D avatars from highly customized text descriptions and pose guidance is a challenging task, due to the intricacy of modeling and texturing in 3D that ensure details and various styles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline for generating expressive high-quality 3D avatars from nothing but text descriptions and pose guidance. In specific, we introduce a 2D diffusion model conditioned on DensePose signal to establish 3D pose control of avatars through 2D images, which enhances view consistency from partially observed scenarios. It addresses the infamous Janus Problem and significantly stablizes the generation process. Moreover, we propose a progressive high-resolution 3D synthesis strategy, which obtains substantial improvement over the quality of the created 3D avatars. To this end, the proposed AvatarVerse pipeline achieves zero-shot 3D modeling of 3D avatars that are not only more expressive, but also in higher quality and fidelity than previous works. Rigorous qualitative evaluations and user studies showcase AvatarVerse's superiority in synthesizing high-fidelity 3D avatars, leading to a new standard in high-quality and stable 3D avatar creation. Our project page is: https://avatarverse3d.github.io

AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose

TL;DR

AvatarVerse tackles the challenge of generating high-fidelity, pose-controlled 3D avatars from natural language descriptions. It introduces a DensePose-conditioned ControlNet and a DensePose SDS loss to achieve pose-aware, view-consistent 3D synthesis, paired with progressive high-resolution generation and mesh refinement to elevate detail. Across qualitative results and user studies, it outperforms prior zero-shot avatar methods in geometry and texture fidelity. This work offers a practical, scalable pathway for expressive, animatable 3D avatars from text prompts and pose guidance.

Abstract

Creating expressive, diverse and high-quality 3D avatars from highly customized text descriptions and pose guidance is a challenging task, due to the intricacy of modeling and texturing in 3D that ensure details and various styles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline for generating expressive high-quality 3D avatars from nothing but text descriptions and pose guidance. In specific, we introduce a 2D diffusion model conditioned on DensePose signal to establish 3D pose control of avatars through 2D images, which enhances view consistency from partially observed scenarios. It addresses the infamous Janus Problem and significantly stablizes the generation process. Moreover, we propose a progressive high-resolution 3D synthesis strategy, which obtains substantial improvement over the quality of the created 3D avatars. To this end, the proposed AvatarVerse pipeline achieves zero-shot 3D modeling of 3D avatars that are not only more expressive, but also in higher quality and fidelity than previous works. Rigorous qualitative evaluations and user studies showcase AvatarVerse's superiority in synthesizing high-fidelity 3D avatars, leading to a new standard in high-quality and stable 3D avatar creation. Our project page is: https://avatarverse3d.github.io
Paper Structure (28 sections, 8 equations, 9 figures)

This paper contains 28 sections, 8 equations, 9 figures.

Figures (9)

  • Figure 1: High-quality 3D avatars generated by AvatarVerse based on a simple text description.
  • Figure 2: The overview of AvatarVerse. Our network takes a text prompt and DensePose signal as input to optimize an explicit NeRF via a DensePose-COCO pre-trained ControlNet. We use strategies including progressive grid, progressive radius, and focus mode to generate high-resolution and high-quality 3D avatars.
  • Figure 3: Qualitative results of our DensePose-conditioned ControlNet. (a) 10 generated images controlled by DensePose with varying viewpoints and body parts. (b) 10 corresponding images with the same viewpoints controlled by human pose (Openpose) signals. It often fails to generate the backside of the avatar ($4$-th (b)) and struggles with part generation (the last two columns). (c) non-skin-tight generation results in both realistic and fictional avatars.
  • Figure 4: Qualitative comparisons with four SOTA methods. We show several non-cherry-picked results generated by AvatarVerse. Our method generates higher-resolution details and maintains a fine-grained geometry compared with other methods. (a): "Spiderman"; " a man wearing a white tanktop and shorts", (b): "Joker"; "a karate master wearing a Black belt", (c): "Stormtrooper"; "a Roman soldier wearing his armor".
  • Figure 5: Flexible Avatar Generation. (a) Partial Generation. All results are generated with the same text prompt "Stormtrooper" and "Batman". (b) Arbitrary Pose Generation.
  • ...and 4 more figures