Table of Contents
Fetching ...

Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data

Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, Baoyuan Wang

TL;DR

Portrait4D presents a data-driven solution to one-shot 4D head avatar synthesis by decoupling data generation from real-image reconstruction. It first learns GenHead, a part-wise, shape-conditioned 4D head generator trained on monocular images to produce large-scale synthetic multi-view, full-motion data, then trains a transformer-based animatable triplane reconstructor Psi on this synthetic data to reconstruct 4D heads from real images with disentangled learning to improve generalization. The key contributions are the GenHead architecture with a part-wise deformation field and FLAME-based morphing, the synthetic-data-driven 4D head reconstruction pipeline, and the disentangled training strategy that isolates reconstruction from reenactment. Experiments show state-of-the-art fidelity, 3D consistency, and motion control, enabling fast, photorealistic head avatars with foreground-background separation for applications in video, VR, and telepresence. The approach highlights the potential of synthetic supervision to scale 4D head synthesis while acknowledging current limitations and ethical considerations.

Abstract

Existing one-shot 4D head synthesis methods usually learn from monocular videos with the aid of 3DMM reconstruction, yet the latter is evenly challenging which restricts them from reasonable 4D head synthesis. We present a method to learn one-shot 4D head synthesis via large-scale synthetic data. The key is to first learn a part-wise 4D generative model from monocular images via adversarial learning, to synthesize multi-view images of diverse identities and full motions as training data; then leverage a transformer-based animatable triplane reconstructor to learn 4D head reconstruction using the synthetic data. A novel learning strategy is enforced to enhance the generalizability to real images by disentangling the learning process of 3D reconstruction and reenactment. Experiments demonstrate our superiority over the prior art.

Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data

TL;DR

Portrait4D presents a data-driven solution to one-shot 4D head avatar synthesis by decoupling data generation from real-image reconstruction. It first learns GenHead, a part-wise, shape-conditioned 4D head generator trained on monocular images to produce large-scale synthetic multi-view, full-motion data, then trains a transformer-based animatable triplane reconstructor Psi on this synthetic data to reconstruct 4D heads from real images with disentangled learning to improve generalization. The key contributions are the GenHead architecture with a part-wise deformation field and FLAME-based morphing, the synthetic-data-driven 4D head reconstruction pipeline, and the disentangled training strategy that isolates reconstruction from reenactment. Experiments show state-of-the-art fidelity, 3D consistency, and motion control, enabling fast, photorealistic head avatars with foreground-background separation for applications in video, VR, and telepresence. The approach highlights the potential of synthetic supervision to scale 4D head synthesis while acknowledging current limitations and ethical considerations.

Abstract

Existing one-shot 4D head synthesis methods usually learn from monocular videos with the aid of 3DMM reconstruction, yet the latter is evenly challenging which restricts them from reasonable 4D head synthesis. We present a method to learn one-shot 4D head synthesis via large-scale synthetic data. The key is to first learn a part-wise 4D generative model from monocular images via adversarial learning, to synthesize multi-view images of diverse identities and full motions as training data; then leverage a transformer-based animatable triplane reconstructor to learn 4D head reconstruction using the synthetic data. A novel learning strategy is enforced to enhance the generalizability to real images by disentangling the learning process of 3D reconstruction and reenactment. Experiments demonstrate our superiority over the prior art.
Paper Structure (56 sections, 11 equations, 24 figures, 6 tables)

This paper contains 56 sections, 11 equations, 24 figures, 6 tables.

Figures (24)

  • Figure 1: Overview of our method. We first learn a 4D generative head model from monocular images to synthesize large-scale 4D data. Then, we utilize the synthetic data to learn a one-shot 4D head reconstruction model in a data-driven manner.
  • Figure 2: Architecture of the animatable triplane reconstructor $\Psi$. An encoder $\mathrm{E}_{global}$ first extracts the appearance feature map of $I_s$. The feature is then sent to a canonicalization and reenactment module $\Phi$ consisting of a de-expression module $\Phi_{de}$ and a reenactment module $\Phi_{re}$ sharing the same structure, which receives motion features from either $I_s$ or $I_d$ for expression neutralization or motion injection accordingly. The reenacted feature is then concatenated with a detail feature map from another encoder $\mathrm{E}_{detail}$, and sent to a decoder $\mathrm{G}_{T}$ to synthesize a tri-plane $T$, bearing the appearance of $I_s$ and the motion of $I_d$. With a FLAME-derived 3D deformation field $\mathcal{D}_{neck}$ to handle neck pose and a volumetric renderer with 2D super-resolution, $T$ can be rendered to a reenacted image $I_{re}$ at an arbitrary view.
  • Figure 3: Qualitative comparison on one-shot head reenactment with previous methods. Best viewed with zoom-in.
  • Figure 4: Reenactment results with different driving targets.
  • Figure 5: Reconstruction and driving results of different settings.
  • ...and 19 more figures