Table of Contents
Fetching ...

GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction

Chao Xu, Xiaochen Zhao, Xiang Deng, Jingxiang Sun, Zhuo Su, Donglin Di, Yebin Liu

TL;DR

This work proposes a novel framework that leverages geometry-aware diffusion to learn strong geometry priors for high-fidelity head avatar reconstruction, and substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.

Abstract

Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to learn strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are incorporated into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.

GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction

TL;DR

This work proposes a novel framework that leverages geometry-aware diffusion to learn strong geometry priors for high-fidelity head avatar reconstruction, and substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.

Abstract

Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to learn strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are incorporated into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.
Paper Structure (50 sections, 2 equations, 13 figures, 4 tables)

This paper contains 50 sections, 2 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: We present GeoDiff4D, a framework that reconstructs animatable 4D head avatars from a single portrait image through geometry-aware diffusion. By jointly predicting portrait image frames and surface normals with a pose-free expression encoder, our method trains 3D Gaussians under dual supervision, achieving exceptional identity preservation and 3D consistency.
  • Figure 2: Overall architecture. Our system takes a reference image, driving expressions, and head poses as input. Specifically, the reference image is encoded into hierarchical identity embeddings using a pretrained VAE and UNet-based reference network. Driving expressions are compressed into low-dimensional latents via a pose-free expression encoder. Both embeddings are injected into the diffusion model through cross-attention, while head pose maps concatenated with noise serve as inputs. The model then jointly predicts portrait images and surface normals. For 3D reconstruction, a UNet refines FLAME meshes using expression latents through cross-attention, and an MLP captures Gaussian dynamics. Finally, the generated surface normals provide additional geometric supervision that further enhances the reconstruction fidelity.
  • Figure 3: Cross-view pairing training strategy. For each identity and timestep, frames from different viewpoints are paired with consistent expressions but varying poses, enabling the encoder to learn view-invariant representations.
  • Figure 4: Self-Reenactment results. We conduct this experiment on a subset of the NeRSemblev2 dataset containing a large number of extreme head poses in both the reference images and driving sequences, enabling a comprehensive evaluation of model performance. We also show surface normals generated by our video generation model.
  • Figure 5: Cross-Reenactment results. Evaluated on a mixture of NeRSemblev2 and in-the-wild collections spanning diverse real and cartoon identities to assess model generalizability.
  • ...and 8 more figures