Table of Contents
Fetching ...

DivAvatar: Diverse 3D Avatar Generation with a Single Prompt

Weijing Tao, Biwen Lei, Kunhao Liu, Shijian Lu, Miaomiao Cui, Xuansong Xie, Chunyan Miao

TL;DR

DivAvatar tackles the challenge of producing diverse 3D avatars from a single text prompt by finetuning EVA3D with diffusion priors and introducing three novel components: strategic noise sampling to revive output diversity, semantic-aware zoom to improve textual fidelity across complex prompts, and a feature-based depth loss to refine geometry. The method combines an unconditional 3D prior with SDS guidance and further refines the results through mesh optimization in a DMTet framework, enabling multiple posed avatars per prompt. Key contributions include the noise-based diversification strategy, region-aware prompt conditioning, and a depth-regularized geometry refinement, all supported by an end-to-end training and inference pipeline. The approach yields high-quality, multi-view avatars aligned with the input text and capable of varied poses, offering a practical tool for rapid avatar generation in 3D pipelines, games, and AR/VR applications.

Abstract

Text-to-Avatar generation has recently made significant strides due to advancements in diffusion models. However, most existing work remains constrained by limited diversity, producing avatars with subtle differences in appearance for a given text prompt. We design DivAvatar, a novel framework that generates diverse avatars, empowering 3D creatives with a multitude of distinct and richly varied 3D avatars from a single text prompt. Different from most existing work that exploits scene-specific 3D representations such as NeRF, DivAvatar finetunes a 3D generative model (i.e., EVA3D), allowing diverse avatar generation from simply noise sampling in inference time. DivAvatar has two key designs that help achieve generation diversity and visual quality. The first is a noise sampling technique during training phase which is critical in generating diverse appearances. The second is a semantic-aware zoom mechanism and a novel depth loss, the former producing appearances of high textual fidelity by separate fine-tuning of specific body parts and the latter improving geometry quality greatly by smoothing the generated mesh in the features space. Extensive experiments show that DivAvatar is highly versatile in generating avatars of diverse appearances.

DivAvatar: Diverse 3D Avatar Generation with a Single Prompt

TL;DR

DivAvatar tackles the challenge of producing diverse 3D avatars from a single text prompt by finetuning EVA3D with diffusion priors and introducing three novel components: strategic noise sampling to revive output diversity, semantic-aware zoom to improve textual fidelity across complex prompts, and a feature-based depth loss to refine geometry. The method combines an unconditional 3D prior with SDS guidance and further refines the results through mesh optimization in a DMTet framework, enabling multiple posed avatars per prompt. Key contributions include the noise-based diversification strategy, region-aware prompt conditioning, and a depth-regularized geometry refinement, all supported by an end-to-end training and inference pipeline. The approach yields high-quality, multi-view avatars aligned with the input text and capable of varied poses, offering a practical tool for rapid avatar generation in 3D pipelines, games, and AR/VR applications.

Abstract

Text-to-Avatar generation has recently made significant strides due to advancements in diffusion models. However, most existing work remains constrained by limited diversity, producing avatars with subtle differences in appearance for a given text prompt. We design DivAvatar, a novel framework that generates diverse avatars, empowering 3D creatives with a multitude of distinct and richly varied 3D avatars from a single text prompt. Different from most existing work that exploits scene-specific 3D representations such as NeRF, DivAvatar finetunes a 3D generative model (i.e., EVA3D), allowing diverse avatar generation from simply noise sampling in inference time. DivAvatar has two key designs that help achieve generation diversity and visual quality. The first is a noise sampling technique during training phase which is critical in generating diverse appearances. The second is a semantic-aware zoom mechanism and a novel depth loss, the former producing appearances of high textual fidelity by separate fine-tuning of specific body parts and the latter improving geometry quality greatly by smoothing the generated mesh in the features space. Extensive experiments show that DivAvatar is highly versatile in generating avatars of diverse appearances.
Paper Structure (18 sections, 4 equations, 6 figures)

This paper contains 18 sections, 4 equations, 6 figures.

Figures (6)

  • Figure 1: Diverse generation of DivAvatar. Input text prompt: A woman wearing ski clothes. Given a single text prompt, DivAvatar can generate a varied set of appearances, while existing works generates appearances with subtle differences. Furthermore, DivAvatar readily generates avatars in different sampled poses, while most existing works can only generate pose according to input pose and require additional articulation for more varied poses.
  • Figure 2: Overview of DivAvatar. We generate a set of diverse avatars that well align with the input text prompt using a strategic noise sampling in the process of finetuning the pre-trained human generative model (i.e. EVA3D). At each training iteration, the noise used as input is either randomly sampled or fixed. The optimisation of the avatar is guided by Score Distillation Sampling loss with our semantic aware zoom technique that ensures higher textual fidelty, and a feature-based depth loss that smooths the geometry. At inference time, DivAvatar is able to generate numerous avatars of different appearance in various pose by sampling the noise and SMPL parameters.
  • Figure 3: We demonstrate qualitative results for two different prompts with two existing methods. For each prompt, we obtain five different samples. Our method generates avatars that are more diverse in appearances and can be readily posed to natural poses. Input text prompt (top): A man wearing Christmas sweater. Input text prompt (bottom): A woman wearing denim.
  • Figure 4: Importance of our noise sampling. Input text prompt: A farmer. We show results across 5 inference samples. Without our noise sampling (i.e, p=1), the model generates avatars with little variations (top row). When we fix the noise sampling in majority of the time (i.e, p=0.1), the generated avatars are much more varied. For each sample, we show the coarse avatar from finetuned EVA3D (left) and the refined avatar after mesh optimization (right). Even though mesh optimization adds photorealistic texture details, the diversity stems from our noise sampling method.
  • Figure 5: Multiview results. Input text prompt of top row: A Black man wearing green tshirt. Input text prompt of bottom row: A man wearing Christmas sweater.
  • ...and 1 more figures