Table of Contents
Fetching ...

Real-time 3D-aware Portrait Video Relighting

Ziqi Cai, Kaiwen Jiang, Shu-Yu Chen, Yu-Kun Lai, Hongbo Fu, Boxin Shi, Lin Gao

TL;DR

This paper presents the first real-time 3D-aware method for relighting in-the-wild videos of talking faces based on Neural Radiance Fields (NeRF), and achieves state-of-the-art results in terms of reconstruction quality, lighting error, lighting instability, temporal consistency and inference speed.

Abstract

Synthesizing realistic videos of talking faces under custom lighting conditions and viewing angles benefits various downstream applications like video conferencing. However, most existing relighting methods are either time-consuming or unable to adjust the viewpoints. In this paper, we present the first real-time 3D-aware method for relighting in-the-wild videos of talking faces based on Neural Radiance Fields (NeRF). Given an input portrait video, our method can synthesize talking faces under both novel views and novel lighting conditions with a photo-realistic and disentangled 3D representation. Specifically, we infer an albedo tri-plane, as well as a shading tri-plane based on a desired lighting condition for each video frame with fast dual-encoders. We also leverage a temporal consistency network to ensure smooth transitions and reduce flickering artifacts. Our method runs at 32.98 fps on consumer-level hardware and achieves state-of-the-art results in terms of reconstruction quality, lighting error, lighting instability, temporal consistency and inference speed. We demonstrate the effectiveness and interactivity of our method on various portrait videos with diverse lighting and viewing conditions.

Real-time 3D-aware Portrait Video Relighting

TL;DR

This paper presents the first real-time 3D-aware method for relighting in-the-wild videos of talking faces based on Neural Radiance Fields (NeRF), and achieves state-of-the-art results in terms of reconstruction quality, lighting error, lighting instability, temporal consistency and inference speed.

Abstract

Synthesizing realistic videos of talking faces under custom lighting conditions and viewing angles benefits various downstream applications like video conferencing. However, most existing relighting methods are either time-consuming or unable to adjust the viewpoints. In this paper, we present the first real-time 3D-aware method for relighting in-the-wild videos of talking faces based on Neural Radiance Fields (NeRF). Given an input portrait video, our method can synthesize talking faces under both novel views and novel lighting conditions with a photo-realistic and disentangled 3D representation. Specifically, we infer an albedo tri-plane, as well as a shading tri-plane based on a desired lighting condition for each video frame with fast dual-encoders. We also leverage a temporal consistency network to ensure smooth transitions and reduce flickering artifacts. Our method runs at 32.98 fps on consumer-level hardware and achieves state-of-the-art results in terms of reconstruction quality, lighting error, lighting instability, temporal consistency and inference speed. We demonstrate the effectiveness and interactivity of our method on various portrait videos with diverse lighting and viewing conditions.

Paper Structure

This paper contains 36 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Given a portrait video shown in the leftmost column, our method reconstructs a 3D relightable face for each video frame. Users can then adjust their viewpoints and lighting conditions interactively. The second column displays relighted video frames with a head pose yaw of 0.3, while the third column presents faces relighted under an alternative lighting condition with a frontal head pose. The rightmost column provides the predicted albedo and geometry of the reconstructed face. Please see the supplementary video for the full results.
  • Figure 2: The pipeline of our method. Given a portrait video shown on the left side, we embed each video frame into an albedo tri-plane and a shading tri-plane using Dual-Encoders. For example, for frame $F_i$, we predict the albedo tri-plane $T_A^i$. Next, we use the estimated lighting condition $L$ and the albedo tri-plane $T_A^i$ to predict the shading tri-plane $T_S^i$ that models the illumination effects on the face. Then we feed $T_S^i$ and $T_A^i$ along with the tri-planes predicted from previous $n$ frames into two transformer models $C_A$ and $C_S$ to enhance the temporal consistency. The two transformers use cross-attention to cooperate for information sharing and alignment between the albedo and shading branches. We add the predicted residual to $T_A^i$ and $T_S^i$ as $\hat{T}_A^i, \hat{T}_S^i$ for better temporal consistency. Finally, we use $\hat{T}_A^i$ and $\hat{T}_S^i$ to condition the volumetric rendering process, producing depth, albedo, shading, color, and super-resolved images.
  • Figure 3: Comparison of video relighting quality on novel views. Our method produces more realistic and consistent results than the baseline methods introduced in Sec. \ref{['subsec:Quantitative Evaluation']}.
  • Figure 4: Comparison of video relighting quality in the input view. We compare our method with three methods: SMFR SMFR, DPR DPR, and ReliTalk qiu2023relitalk. We show the input video frames in the first row and the relighted results under different lighting conditions in the remaining rows. Our method produces more realistic and consistent results than other methods, especially under challenging conditions like the side lighting.
  • Figure 5: Comparison of relighting quality on the input view. We compare our method with six methods: Lumos yeh2022learning, TR pandey2021total, NVPR Zhang_2021_ICCV, SIPR-W wang2020single, DPR DPR and SMFR SMFR. We show the input image in the first column, the sphere renderings from the environment map in the second column, and the relighted results in the remaining columns. Our method produces more realistic and consistent results than the other methods.
  • ...and 1 more figures