Table of Contents
Fetching ...

Disentangled Inference for GANs with Latently Invertible Autoencoder

Jiapeng Zhu, Deli Zhao, Bo Zhang, Bolei Zhou

TL;DR

The paper tackles the critical problem of enabling real-image inference for GANs by addressing latent-space entanglement that hampers encoder learning. It introduces Latently Invertible Autoencoder (LIA), a framework that embeds an invertible mapping between disentangled latent spaces in a two-stage training regime, allowing accurate reconstruction and efficient inference. Empirical results on FFHQ and LSUN demonstrate improved reconstruction quality and versatile image-editing capabilities, while ablations show the necessity of the disentangled $\bm y$-space and the invertible bridge. The approach offers a practical path to GAN inversion and editing for real images, with implications for data augmentation, few-shot learning, and 3D vision tasks.

Abstract

Generative Adversarial Networks (GANs) play an increasingly important role in machine learning. However, there is one fundamental issue hindering their practical applications: the absence of capability for encoding real-world samples. The conventional way of addressing this issue is to learn an encoder for GAN via Variational Auto-Encoder (VAE). In this paper, we show that the entanglement of the latent space for the VAE/GAN framework poses the main challenge for encoder learning. To address the entanglement issue and enable inference in GAN we propose a novel algorithm named Latently Invertible Autoencoder (LIA). The framework of LIA is that an invertible network and its inverse mapping are symmetrically embedded in the latent space of VAE. The decoder of LIA is first trained as a standard GAN with the invertible network and then the partial encoder is learned from a disentangled autoencoder by detaching the invertible network from LIA, thus avoiding the entanglement problem caused by the random latent space. Experiments conducted on the FFHQ face dataset and three LSUN datasets validate the effectiveness of LIA/GAN.

Disentangled Inference for GANs with Latently Invertible Autoencoder

TL;DR

The paper tackles the critical problem of enabling real-image inference for GANs by addressing latent-space entanglement that hampers encoder learning. It introduces Latently Invertible Autoencoder (LIA), a framework that embeds an invertible mapping between disentangled latent spaces in a two-stage training regime, allowing accurate reconstruction and efficient inference. Empirical results on FFHQ and LSUN demonstrate improved reconstruction quality and versatile image-editing capabilities, while ablations show the necessity of the disentangled -space and the invertible bridge. The approach offers a practical path to GAN inversion and editing for real images, with implications for data augmentation, few-shot learning, and 3D vision tasks.

Abstract

Generative Adversarial Networks (GANs) play an increasingly important role in machine learning. However, there is one fundamental issue hindering their practical applications: the absence of capability for encoding real-world samples. The conventional way of addressing this issue is to learn an encoder for GAN via Variational Auto-Encoder (VAE). In this paper, we show that the entanglement of the latent space for the VAE/GAN framework poses the main challenge for encoder learning. To address the entanglement issue and enable inference in GAN we propose a novel algorithm named Latently Invertible Autoencoder (LIA). The framework of LIA is that an invertible network and its inverse mapping are symmetrically embedded in the latent space of VAE. The decoder of LIA is first trained as a standard GAN with the invertible network and then the partial encoder is learned from a disentangled autoencoder by detaching the invertible network from LIA, thus avoiding the entanglement problem caused by the random latent space. Experiments conducted on the FFHQ face dataset and three LSUN datasets validate the effectiveness of LIA/GAN.

Paper Structure

This paper contains 23 sections, 17 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Illustration of the disentanglement via the Swiss-roll manifold. Swiss roll in (c) is obtained from the roll-shaped functional mapping with coordinates in (b). So the blue line in (b) corresponds to the shortest blue path (geodesic on Swiss-roll) in (c). The paths in (e) are the same ones in (c). The red path in (c) and the blue path in (d) are manually shaped for comparison. There are no explicit functions for $\varphi$ and $\phi$ here. The shape of the $\bm y$-space cannot be directly computed from the $\bm z$-space in this figure. So they are plotted as illustration.
  • Figure 2: Illustration of the disentanglement in latent spaces of GANs. (a) and (b) show the interpolation results in the $\textcolor{rgb(255,0,0)}{\bm z}$ and $\textcolor{rgb(0,0,255)}{\bm y}$ spaces between two images, respectively . For (c) and (d), we randomly sample 4,000 faces (including the two faces used in (a) and (b)) as the face domain (gray dots) that is embedded using t-SNE tSNE2008, and then find the two images' interpolated paths (red/blue curves) in the $\bm z$-space and $\bm y$-space respectively. The left-bottom window in (c) shows the zoom-out view of the entire $\bm z$ domain since the whole domain is very sparse.
  • Figure 3: Latently invertible autoencoder (LIA) with adversarial learning. (a) LIA consists of five functional modules: an encoder to extract features $\bm y = f(\bm x)$, an invertible network $\phi$ to reshape feature embeddings to match the prior distribution $\bm z = \phi(\bm y)$ and $\phi^{-1}$ to map latent variables to disentangled feature vectors $\bm y = \phi^{-1}(\bm z)$, a decoder to produce output $\tilde{\bm x} = g(\tilde{\bm y})$, a feature extractor $\epsilon$ to perform reconstruction measure, and a discriminator $c$ to distinguish real/fake distributions. The training of LIA proceeds in the two-stage way: (b) first training the decoder via a GAN model and (c) then the encoder by detaching the invertible network from LIA. The parameters of modules in dark gray in (c) are frozen in this stage.
  • Figure 4: Comparison of different methods on face reconstruction. The first row is original images in the FFHQ dataset and the rest rows are the different methods we compare. We can see that LIA produces better results in terms of image quality and reconstruction accuracy, where the age, hat, and pose, are all well preserved. ALI is not suitable in this scenario because it conveys high-level semantic information which is more powerful for recognition.
  • Figure 5: The exemplar real images of objects and scenes from the LSUN validation set and their reconstructed images by LIA. Three categories are tested, i.e. bedroom, cat, and car.
  • ...and 7 more figures