Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer
Yu Deng, Duomin Wang, Baoyuan Wang
TL;DR
Portrait4D-v2 tackles one-shot 4D head avatar synthesis by generating pseudo multi-view data from monocular videos through a learned static 3D head synthesizer and then training a 4D head synthesizer with cross-view self reenactment. The method combines a tri-plane NeRF representation with a Vision Transformer backbone and motion embeddings to achieve faithful reconstruction, strong geometry consistency, and precise motion control, while reducing reliance on 3DMM priors. A two-stage training pipeline—first $oldsymbol{ m \\Psi_{3d}}$ on synthetic multi-view data, then $oldsymbol{ m \\Psi}$ on real videos with cross-view supervision—distills 3D priors into the 4D model and enables robust learning from in-the-wild data. Empirical results show clear gains over both 2D-based and 3D-aware baselines in LPIPS, FID, ID similarity, AED, and APD, with lightweight inference and scalable training, highlighting the practical impact of integrating 3D priors with 2D data for realistic 4D head avatars.
Abstract
In this paper, we propose a novel learning approach for feed-forward one-shot 4D head avatar synthesis. Different from existing methods that often learn from reconstructing monocular videos guided by 3DMM, we employ pseudo multi-view videos to learn a 4D head synthesizer in a data-driven manner, avoiding reliance on inaccurate 3DMM reconstruction that could be detrimental to the synthesis performance. The key idea is to first learn a 3D head synthesizer using synthetic multi-view images to convert monocular real videos into multi-view ones, and then utilize the pseudo multi-view videos to learn a 4D head synthesizer via cross-view self-reenactment. By leveraging a simple vision transformer backbone with motion-aware cross-attentions, our method exhibits superior performance compared to previous methods in terms of reconstruction fidelity, geometry consistency, and motion control accuracy. We hope our method offers novel insights into integrating 3D priors with 2D supervisions for improved 4D head avatar creation.
