DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars
Tobias Kirschstein, Simon Giebenhain, Matthias Nießner
TL;DR
DiffusionAvatars solves the challenge of creating photorealistic, controllable 3D head avatars by marrying a diffusion-based neural renderer with a neural parametric head model (NPHM) as a geometric prior. The method rasterizes NPHM meshes, attaches learnable surface features via TriPlanes, and conditions the diffusion model on both geometry-derived inputs and explicit expression codes through cross-attention, enabling accurate pose and expression transfer across novel views. Key contributions include the Deferred Diffusion framework, direct expression conditioning, and surface-rigged feature mapping that together yield temporally consistent, high-fidelity renderings that outperform prior 3D and 2D diffusion baselines. The approach demonstrates strong performance on the NeRSemble dataset for self-reenactment and avatar animation and highlights practical potential for VR/AR, teleconferencing, and character animation, while noting limitations in lighting modeling and real-time applicability.
Abstract
DiffusionAvatars synthesizes a high-fidelity 3D head avatar of a person, offering intuitive control over both pose and expression. We propose a diffusion-based neural renderer that leverages generic 2D priors to produce compelling images of faces. For coarse guidance of the expression and head pose, we render a neural parametric head model (NPHM) from the target viewpoint, which acts as a proxy geometry of the person. Additionally, to enhance the modeling of intricate facial expressions, we condition DiffusionAvatars directly on the expression codes obtained from NPHM via cross-attention. Finally, to synthesize consistent surface details across different viewpoints and expressions, we rig learnable spatial features to the head's surface via TriPlane lookup in NPHM's canonical space. We train DiffusionAvatars on RGB videos and corresponding fitted NPHM meshes of a person and test the obtained avatars in both self-reenactment and animation scenarios. Our experiments demonstrate that DiffusionAvatars generates temporally consistent and visually appealing videos for novel poses and expressions of a person, outperforming existing approaches.
