Table of Contents
Fetching ...

DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars

Tobias Kirschstein, Simon Giebenhain, Matthias Nießner

TL;DR

DiffusionAvatars solves the challenge of creating photorealistic, controllable 3D head avatars by marrying a diffusion-based neural renderer with a neural parametric head model (NPHM) as a geometric prior. The method rasterizes NPHM meshes, attaches learnable surface features via TriPlanes, and conditions the diffusion model on both geometry-derived inputs and explicit expression codes through cross-attention, enabling accurate pose and expression transfer across novel views. Key contributions include the Deferred Diffusion framework, direct expression conditioning, and surface-rigged feature mapping that together yield temporally consistent, high-fidelity renderings that outperform prior 3D and 2D diffusion baselines. The approach demonstrates strong performance on the NeRSemble dataset for self-reenactment and avatar animation and highlights practical potential for VR/AR, teleconferencing, and character animation, while noting limitations in lighting modeling and real-time applicability.

Abstract

DiffusionAvatars synthesizes a high-fidelity 3D head avatar of a person, offering intuitive control over both pose and expression. We propose a diffusion-based neural renderer that leverages generic 2D priors to produce compelling images of faces. For coarse guidance of the expression and head pose, we render a neural parametric head model (NPHM) from the target viewpoint, which acts as a proxy geometry of the person. Additionally, to enhance the modeling of intricate facial expressions, we condition DiffusionAvatars directly on the expression codes obtained from NPHM via cross-attention. Finally, to synthesize consistent surface details across different viewpoints and expressions, we rig learnable spatial features to the head's surface via TriPlane lookup in NPHM's canonical space. We train DiffusionAvatars on RGB videos and corresponding fitted NPHM meshes of a person and test the obtained avatars in both self-reenactment and animation scenarios. Our experiments demonstrate that DiffusionAvatars generates temporally consistent and visually appealing videos for novel poses and expressions of a person, outperforming existing approaches.

DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars

TL;DR

DiffusionAvatars solves the challenge of creating photorealistic, controllable 3D head avatars by marrying a diffusion-based neural renderer with a neural parametric head model (NPHM) as a geometric prior. The method rasterizes NPHM meshes, attaches learnable surface features via TriPlanes, and conditions the diffusion model on both geometry-derived inputs and explicit expression codes through cross-attention, enabling accurate pose and expression transfer across novel views. Key contributions include the Deferred Diffusion framework, direct expression conditioning, and surface-rigged feature mapping that together yield temporally consistent, high-fidelity renderings that outperform prior 3D and 2D diffusion baselines. The approach demonstrates strong performance on the NeRSemble dataset for self-reenactment and avatar animation and highlights practical potential for VR/AR, teleconferencing, and character animation, while noting limitations in lighting modeling and real-time applicability.

Abstract

DiffusionAvatars synthesizes a high-fidelity 3D head avatar of a person, offering intuitive control over both pose and expression. We propose a diffusion-based neural renderer that leverages generic 2D priors to produce compelling images of faces. For coarse guidance of the expression and head pose, we render a neural parametric head model (NPHM) from the target viewpoint, which acts as a proxy geometry of the person. Additionally, to enhance the modeling of intricate facial expressions, we condition DiffusionAvatars directly on the expression codes obtained from NPHM via cross-attention. Finally, to synthesize consistent surface details across different viewpoints and expressions, we rig learnable spatial features to the head's surface via TriPlane lookup in NPHM's canonical space. We train DiffusionAvatars on RGB videos and corresponding fitted NPHM meshes of a person and test the obtained avatars in both self-reenactment and animation scenarios. Our experiments demonstrate that DiffusionAvatars generates temporally consistent and visually appealing videos for novel poses and expressions of a person, outperforming existing approaches.
Paper Structure (40 sections, 11 equations, 13 figures, 5 tables)

This paper contains 40 sections, 11 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Given a set of multi-view videos and corresponding fitted meshes, we build a DiffusionAvatar of a person. Our method translates expressions of a morphable model into realistic facial appearances of a person while also providing control over the viewpoint. Project website: https://tobias-kirschstein.github.io/diffusion-avatars/
  • Figure 2: Method overview: We decode an NPHM expression code $z_{exp}$ in two ways to obtain a realistic image: We first extract an NPHM mesh and rasterize it from the desired viewpoint in (a), giving us canonical coordinates, depths, and normal renderings for the head mesh. In (b), the canonical coordinates $x_{can}$ are used to look up spatial features in a TriPlanes structure, rigging the features onto the mesh surface. Together with the rasterizer output, these mapped features form the input for the ControlNet part of DiffusionAvatar. The second route for the expression code goes through a linear layer depicted in (c). It yields expression tokens that are subsequently used in a newly added cross-attention layer inside the pre-trained latent diffusion model. Intuitively, the rasterized inputs should encode pose, shape and rough expression while the direct expression conditioning hints at more detailed facial expressions. The final image is synthesized in (d) by iteratively denoising Gaussian noise using the original DDPM denoising schedule ho2020ddpm.
  • Figure 3: Qualitative results for self-reenactment. We compare against 3D methods (NeRFace gafni2021nerface, Mixture of Volumetric Primitives lombardi2021mixture) and methods employing 2D renderers (Deferred Neural Rendering thies2019dnr, DiffusionRig ding2023diffusionrig). Note that the slightly gray background for DiffusionRig is caused by their training scheme using $\epsilon$-prediction (see \ref{['sec:Diffusion']}). Our method consistently produces more expressive facial performances while simultaneously providing more detailed renderings.
  • Figure 4: Qualitative results for avatar animation. Our method faithfully transfers the source actor's expression and consistently produces compelling renderings, even for complex performances.
  • Figure 5: Ablation of number of cameras on the Multiface dataset.
  • ...and 8 more figures