Table of Contents
Fetching ...

MagicMirror: Fast and High-Quality Avatar Generation with a Constrained Search Space

Armand Comas-Massagué, Di Qiu, Menglei Chai, Marcel Bühler, Amit Raj, Ruiqi Gao, Qiangeng Xu, Mark Matthews, Paulo Gotardo, Octavia Camps, Sergio Orts-Escolano, Thabo Beeler

TL;DR

MagicMirror tackles fast, text-guided 3D avatar generation by constraining the search space with a conditional NeRF trained on a large multi-view head dataset and by introducing a geometry prior learned through diffusion models to produce accurate normal maps. Test-time optimization leverages a Variational Score Distillation objective that jointly refines appearance and geometry, mitigating texture loss and over-saturation that plague traditional SDS-based methods. The framework supports both generic text-driven generation and subject-specific editing via DreamBooth-style personalization, achieving superior visual quality and identity adherence compared with recent baselines. It enables flexible, compositional editing across multiple prompts while keeping the process efficient, though it relies on substantial data and compute and raises privacy and alignment considerations for diffusion priors. Overall, MagicMirror represents a practical advance toward high-fidelity, user-friendly 3D avatar creation for gaming, AR/VR, and telepresence.

Abstract

We introduce a novel framework for 3D human avatar generation and personalization, leveraging text prompts to enhance user engagement and customization. Central to our approach are key innovations aimed at overcoming the challenges in photo-realistic avatar synthesis. Firstly, we utilize a conditional Neural Radiance Fields (NeRF) model, trained on a large-scale unannotated multi-view dataset, to create a versatile initial solution space that accelerates and diversifies avatar generation. Secondly, we develop a geometric prior, leveraging the capabilities of Text-to-Image Diffusion Models, to ensure superior view invariance and enable direct optimization of avatar geometry. These foundational ideas are complemented by our optimization pipeline built on Variational Score Distillation (VSD), which mitigates texture loss and over-saturation issues. As supported by our extensive experiments, these strategies collectively enable the creation of custom avatars with unparalleled visual quality and better adherence to input text prompts. You can find more results and videos in our website: https://syntec-research.github.io/MagicMirror

MagicMirror: Fast and High-Quality Avatar Generation with a Constrained Search Space

TL;DR

MagicMirror tackles fast, text-guided 3D avatar generation by constraining the search space with a conditional NeRF trained on a large multi-view head dataset and by introducing a geometry prior learned through diffusion models to produce accurate normal maps. Test-time optimization leverages a Variational Score Distillation objective that jointly refines appearance and geometry, mitigating texture loss and over-saturation that plague traditional SDS-based methods. The framework supports both generic text-driven generation and subject-specific editing via DreamBooth-style personalization, achieving superior visual quality and identity adherence compared with recent baselines. It enables flexible, compositional editing across multiple prompts while keeping the process efficient, though it relies on substantial data and compute and raises privacy and alignment considerations for diffusion priors. Overall, MagicMirror represents a practical advance toward high-fidelity, user-friendly 3D avatar creation for gaming, AR/VR, and telepresence.

Abstract

We introduce a novel framework for 3D human avatar generation and personalization, leveraging text prompts to enhance user engagement and customization. Central to our approach are key innovations aimed at overcoming the challenges in photo-realistic avatar synthesis. Firstly, we utilize a conditional Neural Radiance Fields (NeRF) model, trained on a large-scale unannotated multi-view dataset, to create a versatile initial solution space that accelerates and diversifies avatar generation. Secondly, we develop a geometric prior, leveraging the capabilities of Text-to-Image Diffusion Models, to ensure superior view invariance and enable direct optimization of avatar geometry. These foundational ideas are complemented by our optimization pipeline built on Variational Score Distillation (VSD), which mitigates texture loss and over-saturation issues. As supported by our extensive experiments, these strategies collectively enable the creation of custom avatars with unparalleled visual quality and better adherence to input text prompts. You can find more results and videos in our website: https://syntec-research.github.io/MagicMirror
Paper Structure (36 sections, 6 equations, 15 figures)

This paper contains 36 sections, 6 equations, 15 figures.

Figures (15)

  • Figure 1: We propose MagicMirror, a method for fast text-guided 3D avatar head generation, with the option of subject personalization. (left) We illustrate how given subject pictures, MagicMirror can generate a 3D avatar with the subject's stylized appearance by following text descriptions. Avatars exhibit high-quality in geometry and texture, with significant altered while preserving the identity of the subject. (right) It can also generate well-known characters by only employing a text prompt.
  • Figure 2: Our two pipelines for 3D head avatar generation and customization follow the same structure: a pre-trained conditional NeRF model serves as 3D prior for fast avatar generation. Our pipelines additionally leverage two pre-trained text-to-image diffusion models as texture and geometry priors, allowing for distillation-based customization of both these components based on input text prompts with state-of-the-art quality.
  • Figure 3: Our novel framework, MagicMirror, can successfully change facial expressions, features, and add accessories or specific styles to the person.
  • Figure 4: Text-to-Image Diffusion Model have the remarkable ability to re-contextualize new concepts. We show the generated normal maps under new text prompts. Note that they are not rendered from a NeRF.
  • Figure 5: Ablation studies. We show that a geometric prior (\ref{['fig: geo prior ablation']}) improves the results (top-left) even when the geometry prior comes from a different subject (center-left). (top-right) Our method yields very similar results in the personalized setting, even for very different NeRF initializations (\ref{['fig: nerf init ablation']}). A sufficiently diverse prior (\ref{['fig: nerf data ablation']}) is required for convincing results (middle). (bottom-left) we demonstrate the effectivenes of VSD instead of SDS. (bottom-right) we show how inverting the latents works to a certain extent but it fails for out-of distribution cases.
  • ...and 10 more figures