Coherent3D: Coherent 3D Portrait Video Reconstruction via Triplane Fusion
Shengze Wang, Xueting Li, Chao Liu, Matthew Chan, Michael Stengel, Henry Fuchs, Shalini De Mello, Koki Nagano
TL;DR
Coherent3D addresses the challenge of producing temporally coherent 3D portrait videos from a single camera by fusing a canonical triplane derived from a reference image with per-frame triplanes. The pipeline uses an LP3D encoder to obtain a raw triplane, an Undistorter to reduce view-dependent distortions against a canonical prior, and a Triplane Fuser with visibility estimation to reconstruct occluded regions, all trained exclusively on synthetic data from Next3D. It achieves state-of-the-art temporal consistency and faithful dynamic rendering across both controlled and in-the-wild datasets, outpacing per-frame lifts and self-reenactment baselines. This work advances democratized telepresence by enabling robust, photorealistic 3D portrait video synthesis from monocular input, without test-time optimization.
Abstract
Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a 3D avatar built from a single reference image, but fail to faithfully preserve the user's per-frame appearance (e.g., instantaneous facial expression and lighting). As a result, none of these two frameworks is an ideal solution for democratized 3D telepresence. In this work, we address this dilemma and propose a novel solution that maintains both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that takes the best of both worlds by fusing a canonical 3D prior from a reference view with dynamic appearance from per-frame input views, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearance. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction and temporal consistency on in-studio and in-the-wild datasets. https://research.nvidia.com/labs/amri/projects/coherent3d
