Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

Yu Deng; Duomin Wang; Baoyuan Wang

Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

Yu Deng, Duomin Wang, Baoyuan Wang

TL;DR

Portrait4D-v2 tackles one-shot 4D head avatar synthesis by generating pseudo multi-view data from monocular videos through a learned static 3D head synthesizer and then training a 4D head synthesizer with cross-view self reenactment. The method combines a tri-plane NeRF representation with a Vision Transformer backbone and motion embeddings to achieve faithful reconstruction, strong geometry consistency, and precise motion control, while reducing reliance on 3DMM priors. A two-stage training pipeline—first $oldsymbol{ m \\Psi_{3d}}$ on synthetic multi-view data, then $oldsymbol{ m \\Psi}$ on real videos with cross-view supervision—distills 3D priors into the 4D model and enables robust learning from in-the-wild data. Empirical results show clear gains over both 2D-based and 3D-aware baselines in LPIPS, FID, ID similarity, AED, and APD, with lightweight inference and scalable training, highlighting the practical impact of integrating 3D priors with 2D data for realistic 4D head avatars.

Abstract

In this paper, we propose a novel learning approach for feed-forward one-shot 4D head avatar synthesis. Different from existing methods that often learn from reconstructing monocular videos guided by 3DMM, we employ pseudo multi-view videos to learn a 4D head synthesizer in a data-driven manner, avoiding reliance on inaccurate 3DMM reconstruction that could be detrimental to the synthesis performance. The key idea is to first learn a 3D head synthesizer using synthetic multi-view images to convert monocular real videos into multi-view ones, and then utilize the pseudo multi-view videos to learn a 4D head synthesizer via cross-view self-reenactment. By leveraging a simple vision transformer backbone with motion-aware cross-attentions, our method exhibits superior performance compared to previous methods in terms of reconstruction fidelity, geometry consistency, and motion control accuracy. We hope our method offers novel insights into integrating 3D priors with 2D supervisions for improved 4D head avatar creation.

Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

TL;DR

on synthetic multi-view data, then

on real videos with cross-view supervision—distills 3D priors into the 4D model and enables robust learning from in-the-wild data. Empirical results show clear gains over both 2D-based and 3D-aware baselines in LPIPS, FID, ID similarity, AED, and APD, with lightweight inference and scalable training, highlighting the practical impact of integrating 3D priors with 2D data for realistic 4D head avatars.

Abstract

Paper Structure (40 sections, 5 equations, 17 figures, 5 tables)

This paper contains 40 sections, 5 equations, 17 figures, 5 tables.

Introduction
Related Work
2D-based talking head generation.
3D-aware head avatar synthesis.
Preliminaries: Triplane-Based 3D Representation
Method
Revisiting Portrait4D Reconstructor
Characteristics of the reconstructor.
3D Synthesizer for Multi-View Video Creation
Learning the 3D synthesizer.
Synthesizing multi-view videos.
Cross-View Self-Reenactment Learning
Effect of cross-view learning.
Experiment
Implementation.
...and 25 more sections

Figures (17)

Figure 1: Our method utilizes a feed-forward 4D head synthesizer to create photorealistic head avatars from a single source image. The facial expressions and neck pose of the 3D heads can be precisely controlled by another driving frame (e.g., see the mouth, eye gaze, and forehead wrinkles). The synthesized results also support free-view rendering, thanks to the underlying accurate head geometries. Best viewed with zoom-in.
Figure 2: Overview of our approach. Given a monocular training video, we first leverage a pre-trained 3D synthesizer $\mathrm{\Psi}_{3d}$ to turn each driving frame within the video into multi-view one, and then use the pseudo multi-view driving frames and a source frame sampled from the original video to perform cross-view self-reenactment for learning a feed-forward 4D head synthesizer $\mathrm{\Psi}$. After training, $\mathrm{\Psi}$ can synthesize an animatable 3D head given two arbitrary images as the source and driving, respectively.
Figure 3: (a) Self-reconstruction comparison between $\mathrm{\Psi}_{3d}$ and deng2023portrait learned with the same data. $\mathrm{\Psi}_{3d}$ yields better reconstruction fidelity. (b) $\mathrm{\Psi}_{3d}$ learned using static images is capable of maintaining geometry consistency across different frames within a video clip.
Figure 4: One-shot head synthesis results of our method on in-the-wild images.
Figure 5: Qualitative comparison using in-the-wild sources and VFHQ drivings.
...and 12 more figures

Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

TL;DR

Abstract

Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

Authors

TL;DR

Abstract

Table of Contents

Figures (17)