Coherent 3D Portrait Video Reconstruction via Triplane Fusion
Shengze Wang, Xueting Li, Chao Liu, Matthew Chan, Michael Stengel, Josef Spjut, Henry Fuchs, Shalini De Mello, Koki Nagano
TL;DR
This work tackles the challenge of producing temporally coherent, photorealistic 3D portrait videos from monocular RGB input by fusing a personalized triplane prior with per-frame observations. The authors introduce a triplane fusion framework comprising a Triplane Undistorter and a Triplane Fuser that leverage a reference frontal triplane and per-frame triplanes lifted by a pretrained LP3D, all trained on synthetic data from the expression-conditioned Next3D GAN. Key contributions include (i) a novel fusion scheme that preserves dynamic per-frame appearance (lighting, expressions, shoulder pose) while maintaining identity consistency through the priors, (ii) a visibility-driven fusion strategy with occlusion weighting and per-plane networks to avoid collapse and preserve 3D structure, and (iii) a robust evaluation framework with multi-view metrics and a challenging NeRSemble dataset demonstrating state-of-the-art results in both reconstruction accuracy and temporal stability. The method enables telepresence applications by delivering faithful, temporally stable 3D portraits from consumer-grade cameras, with limitations including sensitivity to extreme side views and real-time speedups left for future work.
Abstract
Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, potentially democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a personalized 3D prior, but fail to faithfully reconstruct the user's per-frame appearance (e.g., facial expressions and lighting). In this work, we recognize the need to maintain both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that fuses a personalized 3D subject prior with per-frame information, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearances. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction accuracy and temporal consistency on in-studio and in-the-wild datasets.
