Table of Contents
Fetching ...

In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing

Yiran Xu, Zhixin Shu, Cameron Smith, Seoung Wug Oh, Jia-Bin Huang

TL;DR

This work tackles the reconstruction-editability trade-off in 3D-aware GAN inversion when input data contain out-of-distribution (OOD) content. It introduces a dual-radiance-field framework that separates the in-distribution (InD) face and OOD content into two tri-planes, composing them via a learned blending weight during low-resolution rendering and subsequently finetuning a super-resolution module. By confining edits to the InD component, the method preserves identity and enables semantic editing and novel-view synthesis even with occlusions or accessories. Experiments on challenging in-the-wild face videos show improved reconstruction fidelity and editable 3D face rendering, along with practical capabilities such as OOD object removal. The approach advances 3D-aware GAN inversion by explicitly modeling OOD objects and retains compatibility with standard GAN-editing tools, albeit with limitations in extreme poses and temporal consistency for video.

Abstract

3D-aware GANs offer new capabilities for view synthesis while preserving the editing functionalities of their 2D counterparts. GAN inversion is a crucial step that seeks the latent code to reconstruct input images or videos, subsequently enabling diverse editing tasks through manipulation of this latent code. However, a model pre-trained on a particular dataset (e.g., FFHQ) often has difficulty reconstructing images with out-of-distribution (OOD) objects such as faces with heavy make-up or occluding objects. We address this issue by explicitly modeling OOD objects from the input in 3D-aware GANs. Our core idea is to represent the image using two individual neural radiance fields: one for the in-distribution content and the other for the out-of-distribution object. The final reconstruction is achieved by optimizing the composition of these two radiance fields with carefully designed regularization. We demonstrate that our explicit decomposition alleviates the inherent trade-off between reconstruction fidelity and editability. We evaluate reconstruction accuracy and editability of our method on challenging real face images and videos and showcase favorable results against other baselines.

In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing

TL;DR

This work tackles the reconstruction-editability trade-off in 3D-aware GAN inversion when input data contain out-of-distribution (OOD) content. It introduces a dual-radiance-field framework that separates the in-distribution (InD) face and OOD content into two tri-planes, composing them via a learned blending weight during low-resolution rendering and subsequently finetuning a super-resolution module. By confining edits to the InD component, the method preserves identity and enables semantic editing and novel-view synthesis even with occlusions or accessories. Experiments on challenging in-the-wild face videos show improved reconstruction fidelity and editable 3D face rendering, along with practical capabilities such as OOD object removal. The approach advances 3D-aware GAN inversion by explicitly modeling OOD objects and retains compatibility with standard GAN-editing tools, albeit with limitations in extreme poses and temporal consistency for video.

Abstract

3D-aware GANs offer new capabilities for view synthesis while preserving the editing functionalities of their 2D counterparts. GAN inversion is a crucial step that seeks the latent code to reconstruct input images or videos, subsequently enabling diverse editing tasks through manipulation of this latent code. However, a model pre-trained on a particular dataset (e.g., FFHQ) often has difficulty reconstructing images with out-of-distribution (OOD) objects such as faces with heavy make-up or occluding objects. We address this issue by explicitly modeling OOD objects from the input in 3D-aware GANs. Our core idea is to represent the image using two individual neural radiance fields: one for the in-distribution content and the other for the out-of-distribution object. The final reconstruction is achieved by optimizing the composition of these two radiance fields with carefully designed regularization. We demonstrate that our explicit decomposition alleviates the inherent trade-off between reconstruction fidelity and editability. We evaluate reconstruction accuracy and editability of our method on challenging real face images and videos and showcase favorable results against other baselines.
Paper Structure (19 sections, 9 equations, 9 figures, 3 tables)

This paper contains 19 sections, 9 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Semantic editing for out-of-distribution data. We present a method for reconstructing and editing an out-of-distribution (OOD) image or video using a pre-trained 3D-aware generative model (EG3D chan2022efficient). Our method explicitly models and reconstructs the occluders in 3D, allowing faithful reconstruction of the input while preserving the semantic editing capability. Here we showcase the reconstruction and editing results "Less smile", "Younger", "Blond" shen2020interpreting, "Elsa", "Surprised" patashnik2021styleclip. Our method can also remove the OOD part. Data are from the Internet (Creative Commons).
  • Figure 2: Limitations of the previous methods. Existing GAN inversion techniques cannot deal with frames with OOD elements, resulting in a poor reconstruction-editing balance. GOAE Yuan_2023_GOAE can produce faithful editing, but fails to preserve the identity of the input face. PTI roich2022pivotal provides higher reconstruction fieldity, but the edibility suffers.
  • Figure 3: Overview of our method. Given a potrait image or a monocular portrait video, we use two radiance fields to represent (a) in-distribution (InD) face, and (b) out-of-distribution (OOD) item. (a) InD reconstruction is the GAN inversion for the in-distribution natural face. We apply GAN inversion by using pre-trained EG3D model $G$ to the frame, where the pre-trained tri-plane generator and tri-plane decoder $D^I$ are kept frozen. (b) For OOD item, we propose to model them with a separate radiance field represented by an additional tri-plane $\mathbf{T}^O$. During the training process, we optimize the tri-plane $\mathbf{T}^O$, a per-frame latent code $\mathbf{\phi}_t$, and a new decoder $D^O$. The decoder takes as input tri-plane features $\mathbf{T}^O$ and $\mathbf{\phi}_t$ and outputs color $\mathbf{c}^O$, density $\sigma^O$, and blending weight $b$. (c) Composite Rendering compose the InD and OOD radiance fields together by using a composite rendering scheme (Section \ref{['sec:composite_render']}). (d) Finally, we finetune the Super-Resolution module in $G$ to achieve a better output in the high resolution. After training, we can perform various semantic edits and free-view rendering, while preserving the face identity and the OOD components.
  • Figure 4: The effect of finetuning SR module. Without finetuning the SR module, the high-resolution output is blurry.
  • Figure 5: Qualitative comparison of the video reconstruction. We compare our approach with $\mathcal{W+}$ and $\mathcal{W}$ optimization, IDE-3D sun2022ide, GOAE Yuan_2023_GOAE, HFGI3D xie2023hfgi3d, VIVE3D fruhstuck2023vive3d, and PTI roich2022pivotal. Our method shows a better reconstruction accuracy on the OOD videos.
  • ...and 4 more figures