Table of Contents
Fetching ...

Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

Yu Deng, Duomin Wang, Baoyuan Wang

TL;DR

Portrait4D-v2 tackles one-shot 4D head avatar synthesis by generating pseudo multi-view data from monocular videos through a learned static 3D head synthesizer and then training a 4D head synthesizer with cross-view self reenactment. The method combines a tri-plane NeRF representation with a Vision Transformer backbone and motion embeddings to achieve faithful reconstruction, strong geometry consistency, and precise motion control, while reducing reliance on 3DMM priors. A two-stage training pipeline—first $oldsymbol{ m \\Psi_{3d}}$ on synthetic multi-view data, then $oldsymbol{ m \\Psi}$ on real videos with cross-view supervision—distills 3D priors into the 4D model and enables robust learning from in-the-wild data. Empirical results show clear gains over both 2D-based and 3D-aware baselines in LPIPS, FID, ID similarity, AED, and APD, with lightweight inference and scalable training, highlighting the practical impact of integrating 3D priors with 2D data for realistic 4D head avatars.

Abstract

In this paper, we propose a novel learning approach for feed-forward one-shot 4D head avatar synthesis. Different from existing methods that often learn from reconstructing monocular videos guided by 3DMM, we employ pseudo multi-view videos to learn a 4D head synthesizer in a data-driven manner, avoiding reliance on inaccurate 3DMM reconstruction that could be detrimental to the synthesis performance. The key idea is to first learn a 3D head synthesizer using synthetic multi-view images to convert monocular real videos into multi-view ones, and then utilize the pseudo multi-view videos to learn a 4D head synthesizer via cross-view self-reenactment. By leveraging a simple vision transformer backbone with motion-aware cross-attentions, our method exhibits superior performance compared to previous methods in terms of reconstruction fidelity, geometry consistency, and motion control accuracy. We hope our method offers novel insights into integrating 3D priors with 2D supervisions for improved 4D head avatar creation.

Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

TL;DR

Portrait4D-v2 tackles one-shot 4D head avatar synthesis by generating pseudo multi-view data from monocular videos through a learned static 3D head synthesizer and then training a 4D head synthesizer with cross-view self reenactment. The method combines a tri-plane NeRF representation with a Vision Transformer backbone and motion embeddings to achieve faithful reconstruction, strong geometry consistency, and precise motion control, while reducing reliance on 3DMM priors. A two-stage training pipeline—first on synthetic multi-view data, then on real videos with cross-view supervision—distills 3D priors into the 4D model and enables robust learning from in-the-wild data. Empirical results show clear gains over both 2D-based and 3D-aware baselines in LPIPS, FID, ID similarity, AED, and APD, with lightweight inference and scalable training, highlighting the practical impact of integrating 3D priors with 2D data for realistic 4D head avatars.

Abstract

In this paper, we propose a novel learning approach for feed-forward one-shot 4D head avatar synthesis. Different from existing methods that often learn from reconstructing monocular videos guided by 3DMM, we employ pseudo multi-view videos to learn a 4D head synthesizer in a data-driven manner, avoiding reliance on inaccurate 3DMM reconstruction that could be detrimental to the synthesis performance. The key idea is to first learn a 3D head synthesizer using synthetic multi-view images to convert monocular real videos into multi-view ones, and then utilize the pseudo multi-view videos to learn a 4D head synthesizer via cross-view self-reenactment. By leveraging a simple vision transformer backbone with motion-aware cross-attentions, our method exhibits superior performance compared to previous methods in terms of reconstruction fidelity, geometry consistency, and motion control accuracy. We hope our method offers novel insights into integrating 3D priors with 2D supervisions for improved 4D head avatar creation.
Paper Structure (40 sections, 5 equations, 17 figures, 5 tables)

This paper contains 40 sections, 5 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Our method utilizes a feed-forward 4D head synthesizer to create photorealistic head avatars from a single source image. The facial expressions and neck pose of the 3D heads can be precisely controlled by another driving frame (e.g., see the mouth, eye gaze, and forehead wrinkles). The synthesized results also support free-view rendering, thanks to the underlying accurate head geometries. Best viewed with zoom-in.
  • Figure 2: Overview of our approach. Given a monocular training video, we first leverage a pre-trained 3D synthesizer $\mathrm{\Psi}_{3d}$ to turn each driving frame within the video into multi-view one, and then use the pseudo multi-view driving frames and a source frame sampled from the original video to perform cross-view self-reenactment for learning a feed-forward 4D head synthesizer $\mathrm{\Psi}$. After training, $\mathrm{\Psi}$ can synthesize an animatable 3D head given two arbitrary images as the source and driving, respectively.
  • Figure 3: (a) Self-reconstruction comparison between $\mathrm{\Psi}_{3d}$ and deng2023portrait learned with the same data. $\mathrm{\Psi}_{3d}$ yields better reconstruction fidelity. (b) $\mathrm{\Psi}_{3d}$ learned using static images is capable of maintaining geometry consistency across different frames within a video clip.
  • Figure 4: One-shot head synthesis results of our method on in-the-wild images.
  • Figure 5: Qualitative comparison using in-the-wild sources and VFHQ drivings.
  • ...and 12 more figures