Table of Contents
Fetching ...

Coherent 3D Portrait Video Reconstruction via Triplane Fusion

Shengze Wang, Xueting Li, Chao Liu, Matthew Chan, Michael Stengel, Josef Spjut, Henry Fuchs, Shalini De Mello, Koki Nagano

TL;DR

This work tackles the challenge of producing temporally coherent, photorealistic 3D portrait videos from monocular RGB input by fusing a personalized triplane prior with per-frame observations. The authors introduce a triplane fusion framework comprising a Triplane Undistorter and a Triplane Fuser that leverage a reference frontal triplane and per-frame triplanes lifted by a pretrained LP3D, all trained on synthetic data from the expression-conditioned Next3D GAN. Key contributions include (i) a novel fusion scheme that preserves dynamic per-frame appearance (lighting, expressions, shoulder pose) while maintaining identity consistency through the priors, (ii) a visibility-driven fusion strategy with occlusion weighting and per-plane networks to avoid collapse and preserve 3D structure, and (iii) a robust evaluation framework with multi-view metrics and a challenging NeRSemble dataset demonstrating state-of-the-art results in both reconstruction accuracy and temporal stability. The method enables telepresence applications by delivering faithful, temporally stable 3D portraits from consumer-grade cameras, with limitations including sensitivity to extreme side views and real-time speedups left for future work.

Abstract

Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, potentially democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a personalized 3D prior, but fail to faithfully reconstruct the user's per-frame appearance (e.g., facial expressions and lighting). In this work, we recognize the need to maintain both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that fuses a personalized 3D subject prior with per-frame information, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearances. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction accuracy and temporal consistency on in-studio and in-the-wild datasets.

Coherent 3D Portrait Video Reconstruction via Triplane Fusion

TL;DR

This work tackles the challenge of producing temporally coherent, photorealistic 3D portrait videos from monocular RGB input by fusing a personalized triplane prior with per-frame observations. The authors introduce a triplane fusion framework comprising a Triplane Undistorter and a Triplane Fuser that leverage a reference frontal triplane and per-frame triplanes lifted by a pretrained LP3D, all trained on synthetic data from the expression-conditioned Next3D GAN. Key contributions include (i) a novel fusion scheme that preserves dynamic per-frame appearance (lighting, expressions, shoulder pose) while maintaining identity consistency through the priors, (ii) a visibility-driven fusion strategy with occlusion weighting and per-plane networks to avoid collapse and preserve 3D structure, and (iii) a robust evaluation framework with multi-view metrics and a challenging NeRSemble dataset demonstrating state-of-the-art results in both reconstruction accuracy and temporal stability. The method enables telepresence applications by delivering faithful, temporally stable 3D portraits from consumer-grade cameras, with limitations including sensitivity to extreme side views and real-time speedups left for future work.

Abstract

Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, potentially democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a personalized 3D prior, but fail to faithfully reconstruct the user's per-frame appearance (e.g., facial expressions and lighting). In this work, we recognize the need to maintain both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that fuses a personalized 3D subject prior with per-frame information, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearances. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction accuracy and temporal consistency on in-studio and in-the-wild datasets.
Paper Structure (24 sections, 19 equations, 16 figures, 3 tables)

This paper contains 24 sections, 19 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: We propose a triplane fusion method for reconstructing coherent 3D portrait videos. Our method captures the authentic dynamic appearance of the user (e.g., facial expressions and lighting) while producing temporally coherent 3D videos. Trained only using a synthetic 3D video dataset, our encoder-based method achieves both state-of-the-art 3D reconstruction accuracy and temporal consistency.
  • Figure 1: In-the-wild Lighting (GPAvatar Vs. Ours): Our method captures the dynamic lighting changes in the input video whereas GPAvatar fails to do so. Note that the output of the models should match the lighting and expression of input Video Frame (GREEN box).
  • Figure 2: Overview. Given a (near) frontal reference image and an input frame, we reconstruct a triplane prior and a raw triplane respectively using an improved LP3D trevithick2023 (Sec. \ref{['sec:lp3d']}). Next, we combine these two triplanes through a Triplane Fusion module (blue box) that ensures temporal consistency while capturing realtime dynamic conditions like lighting and shoulder pose (Sec. \ref{['sec:undistorter']} and Sec. \ref{['sec:fuser']}). Our model is trained with only synthetic video data generated by a 3D GAN sun2023next3d, with carefully designed augmentation methods to account for shoulder motion and lighting changes (Sec. \ref{['sec:data']}).
  • Figure 2: In-the-wild Expression (GPAvatar Vs. Ours): Our method more accurately captures human expressions in the input video whereas GPAvatar fail to reconstruct authentic expressions. Note that the output of the models should match the lighting and expression of input Video Frame.
  • Figure 3: View-Dependent Distortion:Top: inputs to our model and LP3D. Second & Third Rows: LP3D's reconstructions varies greatly under challenging viewpoints, showing predictable pattern of artifacts including abnormally strong activations on the side being captured (red circle), as well as geometric distortion along the view direction of the camera. We refer to this phenomenon as "View-Dependent Distortion". Fourth & Fifth Row: Our method removes such artifacts and achieves better coherence.
  • ...and 11 more figures