DivAvatar: Diverse 3D Avatar Generation with a Single Prompt
Weijing Tao, Biwen Lei, Kunhao Liu, Shijian Lu, Miaomiao Cui, Xuansong Xie, Chunyan Miao
TL;DR
DivAvatar tackles the challenge of producing diverse 3D avatars from a single text prompt by finetuning EVA3D with diffusion priors and introducing three novel components: strategic noise sampling to revive output diversity, semantic-aware zoom to improve textual fidelity across complex prompts, and a feature-based depth loss to refine geometry. The method combines an unconditional 3D prior with SDS guidance and further refines the results through mesh optimization in a DMTet framework, enabling multiple posed avatars per prompt. Key contributions include the noise-based diversification strategy, region-aware prompt conditioning, and a depth-regularized geometry refinement, all supported by an end-to-end training and inference pipeline. The approach yields high-quality, multi-view avatars aligned with the input text and capable of varied poses, offering a practical tool for rapid avatar generation in 3D pipelines, games, and AR/VR applications.
Abstract
Text-to-Avatar generation has recently made significant strides due to advancements in diffusion models. However, most existing work remains constrained by limited diversity, producing avatars with subtle differences in appearance for a given text prompt. We design DivAvatar, a novel framework that generates diverse avatars, empowering 3D creatives with a multitude of distinct and richly varied 3D avatars from a single text prompt. Different from most existing work that exploits scene-specific 3D representations such as NeRF, DivAvatar finetunes a 3D generative model (i.e., EVA3D), allowing diverse avatar generation from simply noise sampling in inference time. DivAvatar has two key designs that help achieve generation diversity and visual quality. The first is a noise sampling technique during training phase which is critical in generating diverse appearances. The second is a semantic-aware zoom mechanism and a novel depth loss, the former producing appearances of high textual fidelity by separate fine-tuning of specific body parts and the latter improving geometry quality greatly by smoothing the generated mesh in the features space. Extensive experiments show that DivAvatar is highly versatile in generating avatars of diverse appearances.
