Table of Contents
Fetching ...

Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation

Xiyi Chen, Marko Mihajlovic, Shaofei Wang, Sergey Prokudin, Siyu Tang

TL;DR

Morphable Diffusion tackles the challenge of producing fully 3D-consistent, animatable human avatars from a single image. It unifies a 3D morphable model with a state-of-the-art multi-view diffusion backbone, conditioning the denoising process on a 3DMM-aware feature volume and CLIP-guided cues to enable both novel view synthesis and expression-driven animation. The approach introduces a shuffled training scheme and SparseConvNet-based 3D conditioning to preserve identity while allowing new facial expressions and poses for unseen subjects. Quantitative and qualitative results on FaceScape and THuman 2.0 show consistent improvements over strong baselines, with analysis highlighting the importance of 3D conditioning and dedicated training strategies. This work advances practical photorealistic avatar creation from minimal input and provides a path toward more controllable, animatable digital humans.

Abstract

Recent advances in generative diffusion models have enabled the previously unfeasible capability of generating 3D assets from a single input image or a text prompt. In this work, we aim to enhance the quality and functionality of these models for the task of creating controllable, photorealistic human avatars. We achieve this by integrating a 3D morphable model into the state-of-the-art multi-view-consistent diffusion approach. We demonstrate that accurate conditioning of a generative pipeline on the articulated 3D model enhances the baseline model performance on the task of novel view synthesis from a single image. More importantly, this integration facilitates a seamless and accurate incorporation of facial expression and body pose control into the generation process. To the best of our knowledge, our proposed framework is the first diffusion model to enable the creation of fully 3D-consistent, animatable, and photorealistic human avatars from a single image of an unseen subject; extensive quantitative and qualitative evaluations demonstrate the advantages of our approach over existing state-of-the-art avatar creation models on both novel view and novel expression synthesis tasks. The code for our project is publicly available.

Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation

TL;DR

Morphable Diffusion tackles the challenge of producing fully 3D-consistent, animatable human avatars from a single image. It unifies a 3D morphable model with a state-of-the-art multi-view diffusion backbone, conditioning the denoising process on a 3DMM-aware feature volume and CLIP-guided cues to enable both novel view synthesis and expression-driven animation. The approach introduces a shuffled training scheme and SparseConvNet-based 3D conditioning to preserve identity while allowing new facial expressions and poses for unseen subjects. Quantitative and qualitative results on FaceScape and THuman 2.0 show consistent improvements over strong baselines, with analysis highlighting the importance of 3D conditioning and dedicated training strategies. This work advances practical photorealistic avatar creation from minimal input and provides a path toward more controllable, animatable digital humans.

Abstract

Recent advances in generative diffusion models have enabled the previously unfeasible capability of generating 3D assets from a single input image or a text prompt. In this work, we aim to enhance the quality and functionality of these models for the task of creating controllable, photorealistic human avatars. We achieve this by integrating a 3D morphable model into the state-of-the-art multi-view-consistent diffusion approach. We demonstrate that accurate conditioning of a generative pipeline on the articulated 3D model enhances the baseline model performance on the task of novel view synthesis from a single image. More importantly, this integration facilitates a seamless and accurate incorporation of facial expression and body pose control into the generation process. To the best of our knowledge, our proposed framework is the first diffusion model to enable the creation of fully 3D-consistent, animatable, and photorealistic human avatars from a single image of an unseen subject; extensive quantitative and qualitative evaluations demonstrate the advantages of our approach over existing state-of-the-art avatar creation models on both novel view and novel expression synthesis tasks. The code for our project is publicly available.
Paper Structure (16 sections, 10 equations, 14 figures, 8 tables)

This paper contains 16 sections, 10 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Morphable diffusion. We introduce a morphable diffusion model to enable consistent controllable novel view synthesis of humans from a single image. Given a single input image (a) and a morphable mesh model with a target facial expression (b) our method directly generates 3D consistent and photo-realistic images from novel viewpoints (c). Using the generated multi-view consistent images, we can reconstruct a coarse 3D model (d) using off-the-shelf neural surface reconstruction methods such as neus2.
  • Figure 2: Morphable diffusion step. This figure gives an overview of a single denoising step of the proposed 3D morphable diffusion pipeline. Our morphable denoiser takes as input a single image $\mathbf{y}$ and the underlying human model $\mathcal{M}$ and generates $N$ novel views from pre-defined viewpoints. Given the noisy images of N fixed views $x_t^{(1:N)}$ obtained from the previous iteration, camera projection matrices $P^{(1:N)} = (K^{(1:N)}, R^{(1:N)}, T^{(1:N)})$, and the target articulated 3D mesh model, we construct A) a morphable noise volume by attaching the 2D noise features onto mesh vertices that are processed by a SparseConvNet $f_\theta$ to output a 3DMM-aware feature volume $\mathbf{F}_V$, which is further interpolated to the frustum $\mathbf{F}^{(i)}$ of a target view $(i)$ that we wish to synthesize. B) The noisy target image $\mathbf{x}_t^{(i)}$, the input image $\mathbf{y}$, and the target feature frustum are then processed by a pre-trained 2D UNet akin to liu2023syncdreamer to predict the denoised image in the next iteration $\mathbf{x}_{t-1}^{(i)}$.
  • Figure 3: Single-view reconstruction of human faces. In addition to the single input view, our method also takes as input a mesh of the facial expression corresponding to the input image. Our method produces more plausible and realistic novel views compared to state-of-the-art methods. While providing multi-view consistency, PixelNeRF yu2020pixelnerf and SSDNeRF ssdnerf produce overly blurry results. Zero-1-to-3 Liu_2023_ICCV generates images of good quality which however fail to preserve multi-view consistency and do not align with the ground truth target views. SyncDreamer liu2023syncdreamer produces multi-view consistent images with relatively accurate facial expressions that however lose the resemblance. For more details and discussion, please see section \ref{['subsec:nvs']}.
  • Figure 4: Single-view reconstruction of human bodies. Our method is the only one that reconstructs the correct body poses. The relatively low resolution of all methods, however, limits the amount of details in the generated images.
  • Figure 5: Novel facial expression synthesis. Qualitative comparison with DECAFeng:SIGGRAPH:2021, MoFaNeRF zhuang2022mofanerf, and DiffusionRig ding2023diffusionrig on novel facial expression synthesis. DiffusionRig is denoted with $^{*}$ since it requires per-subject finetuning with additional images. Our morphable diffusion model is the only one that successfully synthesizes novel views for a novel facial expression while retaining high fidelity. For more details and discussion, please see section \ref{['subsec:novel_expression']}.
  • ...and 9 more figures