Table of Contents
Fetching ...

Instant 3D Human Avatar Generation using Image Diffusion Models

Nikos Kolotouros, Thiemo Alldieck, Enric Corona, Eduard Gabriel Bazavan, Cristian Sminchisescu

TL;DR

AvatarPopUp introduces a fast, feed-forward pipeline for multimodal 3D human avatar generation by decoupling text/image conditioned diffusion-based front/back view synthesis from a 3D reconstruction stage. The method fine-tunes lightweight diffusion encoders with pose/shape conditioning and samples multiple back views to produce diverse, textured avatars within seconds, enabling interactive creation and animation. A PHORHUM-inspired, pixel-aligned 3D reconstruction network leverages front/back views and optional body controls to output a riggable textured mesh, with an accompanying animation pipeline that preserves identity across poses. Across text-to-3D generation, single-image reconstruction, and virtual try-on, AvatarPopUp achieves high quality, diverse results with strong speed advantages over optimization-based baselines, opening scalable applications in entertainment, education, and design.

Abstract

We present AvatarPopUp, a method for fast, high quality 3D human avatar generation from different input modalities, such as images and text prompts and with control over the generated pose and shape. The common theme is the use of diffusion-based image generation networks that are specialized for each particular task, followed by a 3D lifting network. We purposefully decouple the generation from the 3D modeling which allow us to leverage powerful image synthesis priors, trained on billions of text-image pairs. We fine-tune latent diffusion networks with additional image conditioning for image generation and back-view prediction, and to support qualitatively different multiple 3D hypotheses. Our partial fine-tuning approach allows to adapt the networks for each task without inducing catastrophic forgetting. In our experiments, we demonstrate that our method produces accurate, high-quality 3D avatars with diverse appearance that respect the multimodal text, image, and body control signals. Our approach can produce a 3D model in as few as 2 seconds, a four orders of magnitude speedup wrt the vast majority of existing methods, most of which solve only a subset of our tasks, and with fewer controls. AvatarPopUp enables applications that require the controlled 3D generation of human avatars at scale. The project website can be found at https://www.nikoskolot.com/avatarpopup/.

Instant 3D Human Avatar Generation using Image Diffusion Models

TL;DR

AvatarPopUp introduces a fast, feed-forward pipeline for multimodal 3D human avatar generation by decoupling text/image conditioned diffusion-based front/back view synthesis from a 3D reconstruction stage. The method fine-tunes lightweight diffusion encoders with pose/shape conditioning and samples multiple back views to produce diverse, textured avatars within seconds, enabling interactive creation and animation. A PHORHUM-inspired, pixel-aligned 3D reconstruction network leverages front/back views and optional body controls to output a riggable textured mesh, with an accompanying animation pipeline that preserves identity across poses. Across text-to-3D generation, single-image reconstruction, and virtual try-on, AvatarPopUp achieves high quality, diverse results with strong speed advantages over optimization-based baselines, opening scalable applications in entertainment, education, and design.

Abstract

We present AvatarPopUp, a method for fast, high quality 3D human avatar generation from different input modalities, such as images and text prompts and with control over the generated pose and shape. The common theme is the use of diffusion-based image generation networks that are specialized for each particular task, followed by a 3D lifting network. We purposefully decouple the generation from the 3D modeling which allow us to leverage powerful image synthesis priors, trained on billions of text-image pairs. We fine-tune latent diffusion networks with additional image conditioning for image generation and back-view prediction, and to support qualitatively different multiple 3D hypotheses. Our partial fine-tuning approach allows to adapt the networks for each task without inducing catastrophic forgetting. In our experiments, we demonstrate that our method produces accurate, high-quality 3D avatars with diverse appearance that respect the multimodal text, image, and body control signals. Our approach can produce a 3D model in as few as 2 seconds, a four orders of magnitude speedup wrt the vast majority of existing methods, most of which solve only a subset of our tasks, and with fewer controls. AvatarPopUp enables applications that require the controlled 3D generation of human avatars at scale. The project website can be found at https://www.nikoskolot.com/avatarpopup/.
Paper Structure (29 sections, 4 equations, 12 figures, 5 tables)

This paper contains 29 sections, 4 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: We present AvatarPopUp, a new method for the automatic generation of 3D human assets. AvatarPopUp can generate rigged 3D models from text or from single images and has control over body pose and shape. In this example, we show 77 models generated from various text prompts in 12 minutes on a single GPU.
  • Figure 2: AvatarPopUp method. (Top) AvatarPopUp builds on the capacity of text-to-image models to generate highly detailed and diverse input images. First, a Latent Diffusion network takes a text prompt and a target body pose and shape $\mathcal{G}$, and generates a highly detailed front image $I_f$ of a person. Next, a second network generates a consistent back view $I_b$ in the same pose and clothing. (Bottom) We perform pixel-aligned 3D reconstruction given the generated front/back views $I_f, I_b$ and optionally the given 3D body pose and shape $\mathcal{G}$. This decoupling enables the generation of 3D avatars from text, images or a combination of the two.
  • Figure 3: Diverse back view hypotheses. Conditioned on the front view, our method is able to generate diverse plausible back views of the person, with different hairstyles, wrinkle patterns, or lighting. Our network can also be controlled with text (second row), to add fine-grained detail to our generated back-side views.
  • Figure 4: Diversity of our 3D generation. For the same text prompt and the same pose and shape conditioning, our model can generate a diverse set of 3D avatars that respect both the text and the 3D body controls.
  • Figure 5: Comparisons with text-to-3d human generation methods. Our method generates high quality results that respect the text prompt well, at a fraction of the others' runtime, cf.\ref{['tab:numerical_eval_image']}. TADA's results appear unnatural; DreamHuman failed for one subject and produces oversaturated colors; CHUPA failed to respect the prompt.
  • ...and 7 more figures