FED-NeRF: Achieve High 3D Consistency and Temporal Coherence for Face Video Editing on Dynamic NeRF

Hao Zhang; Yu-Wing Tai; Chi-Keung Tang

FED-NeRF: Achieve High 3D Consistency and Temporal Coherence for Face Video Editing on Dynamic NeRF

Hao Zhang, Yu-Wing Tai, Chi-Keung Tang

TL;DR

FED-NeRF introduces a 4D face video editing framework built on dynamic GAN-NeRF (Omniavatar) that reconstructs frame-wise latent codes and FLAME geometry from video, then stabilizes temporal transitions and supports semantic editing. By integrating a Latent Code Estimator, a Face Geometry Estimator, a Catmull–Rom-based Stabilizer, and a StyleClip-inspired Semantic Editor, the method achieves simultaneous 3D view consistency and temporal coherence across edited videos. Evaluation on in-the-wild sequences shows state-of-the-art performance in 3D consistency (COLMAP reprojection) and temporal stability (Raft-based metrics), with ablations confirming the benefit of multi-frame latent aggregation and geometry stabilization. The work demonstrates that editing in 4D space with a dynamic NeRF yields natural, identity-preserving face videos with controllable semantic changes, signaling a significant step toward practical 4D video editing.

Abstract

The success of the GAN-NeRF structure has enabled face editing on NeRF to maintain 3D view consistency. However, achieving simultaneously multi-view consistency and temporal coherence while editing video sequences remains a formidable challenge. This paper proposes a novel face video editing architecture built upon the dynamic face GAN-NeRF structure, which effectively utilizes video sequences to restore the latent code and 3D face geometry. By editing the latent code, multi-view consistent editing on the face can be ensured, as validated by multiview stereo reconstruction on the resulting edited images in our dynamic NeRF. As the estimation of face geometries occurs on a frame-by-frame basis, this may introduce a jittering issue. We propose a stabilizer that maintains temporal coherence by preserving smooth changes of face expressions in consecutive frames. Quantitative and qualitative analyses reveal that our method, as the pioneering 4D face video editor, achieves state-of-the-art performance in comparison to existing 2D or 3D-based approaches independently addressing identity and motion. Codes will be released.

FED-NeRF: Achieve High 3D Consistency and Temporal Coherence for Face Video Editing on Dynamic NeRF

TL;DR

Abstract

Paper Structure (23 sections, 11 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 11 equations, 10 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Method
Preliminaries
Training data
Latent Code Estimator
Face Geometry Estimator
Stabilizer
Semantic Editor
Experiments
Editing in-the-wild video sequences
Comparison
3D View Consistency
Temporal Coherence
Reconstruction
...and 8 more sections

Figures (10)

Figure 1: Face video Editing results. Editing prompts are "Wear a pair of glasses" and "Curly hair". Every frame within the output sequence is rendered via the dynamic NeRF, which is precisely controlled by the estimated 3D facial geometry. Furthermore, the other 3D views effectively showcase the consistency of the dynamic NeRF.
Figure 2: The Overview of our model. Given a video sequence, Our model will estimate a latent code $w^+$ and FLAME controls. The editor will subsequently modify the $w^+$ as $\bar{w^+}$ in accordance with a given text prompt. The Stabilizer then ensures the temporal consistency of the FLAME controls. Finally, the edited video sequence can be produced under the guidance of the stabilized FLAME controls and $\bar{w^+}$
Figure 3: The structure of Latent Code Estimator. Given a video sequence, the Image encoder will extract features for each individual frame, which are then aggregated via the Cross Attention layer to produce a singular latent code output denoted as $w_f^+$. The Losses $\mathcal{L}_R$ and $\mathcal{L}_{\mathcal{ID}}$ are computed across multiple pairs of rendered images by utilizing the estimated $w_f^+$ and ground truth $w^+$ respectively.
Figure 4: The structure of the Face Geometry Estimator. An estimation of the FLAME control $p^{\prime}$ can be obtained from an input image. Subsequently, pairs of images can be rendered with randomly sampled camera poses, and these losses can be computed based on these pairs.
Figure 5: More in-the-wild editing results. These examples show that our model can achieve 3D consistency even when performing certain edits that alter the facial geometry, such as "Wear a pair of glasses", "Short curly hair", and so on.
...and 5 more figures

FED-NeRF: Achieve High 3D Consistency and Temporal Coherence for Face Video Editing on Dynamic NeRF

TL;DR

Abstract

FED-NeRF: Achieve High 3D Consistency and Temporal Coherence for Face Video Editing on Dynamic NeRF

Authors

TL;DR

Abstract

Table of Contents

Figures (10)