In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing
Yiran Xu, Zhixin Shu, Cameron Smith, Seoung Wug Oh, Jia-Bin Huang
TL;DR
This work tackles the reconstruction-editability trade-off in 3D-aware GAN inversion when input data contain out-of-distribution (OOD) content. It introduces a dual-radiance-field framework that separates the in-distribution (InD) face and OOD content into two tri-planes, composing them via a learned blending weight during low-resolution rendering and subsequently finetuning a super-resolution module. By confining edits to the InD component, the method preserves identity and enables semantic editing and novel-view synthesis even with occlusions or accessories. Experiments on challenging in-the-wild face videos show improved reconstruction fidelity and editable 3D face rendering, along with practical capabilities such as OOD object removal. The approach advances 3D-aware GAN inversion by explicitly modeling OOD objects and retains compatibility with standard GAN-editing tools, albeit with limitations in extreme poses and temporal consistency for video.
Abstract
3D-aware GANs offer new capabilities for view synthesis while preserving the editing functionalities of their 2D counterparts. GAN inversion is a crucial step that seeks the latent code to reconstruct input images or videos, subsequently enabling diverse editing tasks through manipulation of this latent code. However, a model pre-trained on a particular dataset (e.g., FFHQ) often has difficulty reconstructing images with out-of-distribution (OOD) objects such as faces with heavy make-up or occluding objects. We address this issue by explicitly modeling OOD objects from the input in 3D-aware GANs. Our core idea is to represent the image using two individual neural radiance fields: one for the in-distribution content and the other for the out-of-distribution object. The final reconstruction is achieved by optimizing the composition of these two radiance fields with carefully designed regularization. We demonstrate that our explicit decomposition alleviates the inherent trade-off between reconstruction fidelity and editability. We evaluate reconstruction accuracy and editability of our method on challenging real face images and videos and showcase favorable results against other baselines.
