Table of Contents
Fetching ...

Coherent3D: Coherent 3D Portrait Video Reconstruction via Triplane Fusion

Shengze Wang, Xueting Li, Chao Liu, Matthew Chan, Michael Stengel, Henry Fuchs, Shalini De Mello, Koki Nagano

TL;DR

Coherent3D addresses the challenge of producing temporally coherent 3D portrait videos from a single camera by fusing a canonical triplane derived from a reference image with per-frame triplanes. The pipeline uses an LP3D encoder to obtain a raw triplane, an Undistorter to reduce view-dependent distortions against a canonical prior, and a Triplane Fuser with visibility estimation to reconstruct occluded regions, all trained exclusively on synthetic data from Next3D. It achieves state-of-the-art temporal consistency and faithful dynamic rendering across both controlled and in-the-wild datasets, outpacing per-frame lifts and self-reenactment baselines. This work advances democratized telepresence by enabling robust, photorealistic 3D portrait video synthesis from monocular input, without test-time optimization.

Abstract

Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a 3D avatar built from a single reference image, but fail to faithfully preserve the user's per-frame appearance (e.g., instantaneous facial expression and lighting). As a result, none of these two frameworks is an ideal solution for democratized 3D telepresence. In this work, we address this dilemma and propose a novel solution that maintains both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that takes the best of both worlds by fusing a canonical 3D prior from a reference view with dynamic appearance from per-frame input views, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearance. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction and temporal consistency on in-studio and in-the-wild datasets. https://research.nvidia.com/labs/amri/projects/coherent3d

Coherent3D: Coherent 3D Portrait Video Reconstruction via Triplane Fusion

TL;DR

Coherent3D addresses the challenge of producing temporally coherent 3D portrait videos from a single camera by fusing a canonical triplane derived from a reference image with per-frame triplanes. The pipeline uses an LP3D encoder to obtain a raw triplane, an Undistorter to reduce view-dependent distortions against a canonical prior, and a Triplane Fuser with visibility estimation to reconstruct occluded regions, all trained exclusively on synthetic data from Next3D. It achieves state-of-the-art temporal consistency and faithful dynamic rendering across both controlled and in-the-wild datasets, outpacing per-frame lifts and self-reenactment baselines. This work advances democratized telepresence by enabling robust, photorealistic 3D portrait video synthesis from monocular input, without test-time optimization.

Abstract

Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a 3D avatar built from a single reference image, but fail to faithfully preserve the user's per-frame appearance (e.g., instantaneous facial expression and lighting). As a result, none of these two frameworks is an ideal solution for democratized 3D telepresence. In this work, we address this dilemma and propose a novel solution that maintains both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that takes the best of both worlds by fusing a canonical 3D prior from a reference view with dynamic appearance from per-frame input views, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearance. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction and temporal consistency on in-studio and in-the-wild datasets. https://research.nvidia.com/labs/amri/projects/coherent3d

Paper Structure

This paper contains 15 sections, 12 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Given a single reference image and a single-view video frame, our method reconstructs the authentic dynamic appearance of the user (e.g., facial expressions and lighting) while producing a temporally coherent 3D video. A previous single-view 3D lifting method (LP3D) that reconstructs the avatar from the video frame on a per-frame basis suffers from distortions and temporal inconsistencies. A portrait reenactment method (GPAvatar) drives the identity in the reference image using the video frame, but fails to capture accurate facial expressions (e.g., smile) and per-frame appearance (e.g., lighting). The output should be compared to the appearance of the per-frame video (green box).
  • Figure 2: View-Dependent Distortion:Top: inputs to our model and LP3D. Second & Third Rows: LP3D's reconstructions varies greatly under challenging viewpoints, showing predictable pattern of artifacts including abnormally strong activations on the side being captured (red circle), as well as geometric distortion along the view direction of the camera. We refer to this phenomenon as "View-Dependent Distortion". Fourth: Our method removes such artifacts and achieves better coherence.
  • Figure 3: Overview. Given a (near) frontal reference image and an input frame, we reconstruct a canonical triplane and a raw triplane, respectively, using LP3D trevithick2023 (Sec. \ref{['sec:lp3d']}). Next, we combine these two triplanes through a Triplane Fusion module (blue box) that ensures temporal consistency while preserving realtime dynamics (e.g., lighting and shoulder pose) (Sec. \ref{['sec:undistorter']} and Sec. \ref{['sec:fuser']}). Our model is trained with only synthetic video data generated by a 3D GAN sun2023next3d, with carefully designed augmentations to preserve shoulder motion and lighting (Sec. \ref{['sec:data']}).
  • Figure 4: Visual comparisons with baseline methods. Our method strikes a balance between coherent reconstruction and faithful dynamic conditions like expressions. LP3D (third column) exhibits inconsistencies in identities, hairstyles, and artifacts (red circles). GPAvatar (fourth column) fails to capture challenging expressions (first row), new information not present in the reference image, (the stuck-out tongue in second and third rows), and identity of the person (last row).