Table of Contents
Fetching ...

Improving generative adversarial network inversion via fine-tuning GAN encoders

Cheng Yu, Wenmin Wang, Roberto Bugiolacchi

TL;DR

The paper tackles the challenge of inverting real images across diverse GAN architectures, where existing methods largely specialize to StyleGAN. It proposes a self-supervised pre-training of encoders with an adaptive block and cropping-attention losses, followed by fine-tuning on real images via latent-regularization. Across StyleGAN, PGGAN, and BigGAN, the approach achieves superior inversion quality and enables real-face editing, outperforming state-of-the-art baselines. Ablation studies confirm the importance of cropping attentions and SSIM, while noting limitations in preserving fine accessories due to current generator capabilities.

Abstract

Generative adversarial networks (GANs) can synthesize high-quality (HQ) images, and GAN inversion is a technique that discovers how to invert given images back to latent space. While existing methods perform on StyleGAN inversion, they have limited performance and are not generalized to different GANs. To address these issues, we proposed a self-supervised method to pre-train and fine-tune GAN encoders. First, we designed an adaptive block to fit different encoder architectures for inverting diverse GANs. Then we pre-train GAN encoders using synthesized images and emphasize local regions through cropping images. Finally, we fine-tune the pre-trained GAN encoder for inverting real images. Compared with state-of-the-art methods, our method achieved better results that reconstructed high-quality images on mainstream GANs. Our code and pre-trained models are available at: https://github.com/disanda/Deep-GAN-Encoders.

Improving generative adversarial network inversion via fine-tuning GAN encoders

TL;DR

The paper tackles the challenge of inverting real images across diverse GAN architectures, where existing methods largely specialize to StyleGAN. It proposes a self-supervised pre-training of encoders with an adaptive block and cropping-attention losses, followed by fine-tuning on real images via latent-regularization. Across StyleGAN, PGGAN, and BigGAN, the approach achieves superior inversion quality and enables real-face editing, outperforming state-of-the-art baselines. Ablation studies confirm the importance of cropping attentions and SSIM, while noting limitations in preserving fine accessories due to current generator capabilities.

Abstract

Generative adversarial networks (GANs) can synthesize high-quality (HQ) images, and GAN inversion is a technique that discovers how to invert given images back to latent space. While existing methods perform on StyleGAN inversion, they have limited performance and are not generalized to different GANs. To address these issues, we proposed a self-supervised method to pre-train and fine-tune GAN encoders. First, we designed an adaptive block to fit different encoder architectures for inverting diverse GANs. Then we pre-train GAN encoders using synthesized images and emphasize local regions through cropping images. Finally, we fine-tune the pre-trained GAN encoder for inverting real images. Compared with state-of-the-art methods, our method achieved better results that reconstructed high-quality images on mainstream GANs. Our code and pre-trained models are available at: https://github.com/disanda/Deep-GAN-Encoders.

Paper Structure

This paper contains 15 sections, 6 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Overview of GAN inversion, Inverting real images to latent vectors is called GAN inversion (indicated by black arrow). Using the latent vectors as input to the generator from pre-trained GAN, we can generate reconstructions (indicated by black arrows).
  • Figure 2: The 1st row displays face images synthesized by StyleGAN2 (FFHQ 1014×1024) StyleGAN2. In the 2nd row, we present our reconstructions. Using our method, the 3rd row shows three real faces (on the left) and their reconstructions (on the right). Our method accurately reproduces the original faces. The 4th row demonstrates the ability of our method to edit faces using five interpretable latent directions using RFM: mouth, eyeglasses, younger, older, and pose.
  • Figure 3: Inversion data flow for StyleGANs consists of 3 steps. In Step 1, we input latent vectors ($\mathbf{w}$, $\mathbf{z_c}$) with their inverted vectors ($\mathbf{w'}$, $\mathbf{z'_c}$) to $G$ to synthesize and reconstruct images ($\mathbf{x}$, $\mathbf{x'}$). In Step 2, we crop $\mathbf{x}$ and $\mathbf{x'}$ to obtain two attention areas that highlight main objects in images. At Step 3, we train $E$ to invert images back to latent vectors.
  • Figure 4: Overview of the adaptive block architecture for GAN encoder. NORM denotes the normalization layer, and AC denotes the activation layer. FC and learnable $\mathbf{z}_n'$ are removed in other PGGAN and BigGAN. We have also added layer-wise label vectors $c$ for BigGAN inversion.
  • Figure 5: Cropping attention to center-aligned faces. The gray dashed line indicates the boundaries of the original image. Empirically, the black dashed line crops the first attention (AT1) around 0.75$\%$ width, and the red dashed line crops the second attention (AT2) around 0.69 $\%$ width and height from the original figures.
  • ...and 11 more figures