Table of Contents
Fetching ...

Deep Feature Consistent Variational Autoencoder

Xianxu Hou, Linlin Shen, Ke Sun, Guoping Qiu

TL;DR

This work replaces pixel-wise reconstruction loss in Variational Autoencoders with a deep feature perceptual loss computed from a fixed pre-trained CNN, improving perceptual quality of generated faces while preserving latent-space structure. The authors design a deep CNN-based CVAE with 4-layer encoder/decoder and a VGG-based loss network, training with KL regularization and multi-layer feature losses across relu1_2, relu2_1, and relu3_1. Experiments on CelebA show clearer reconstructions than a plain VAE and comparable or better results than DCGAN in some aspects, with a latent space that enables smooth interpolations, attribute manipulation, and effective facial attribute prediction (86.95% average accuracy). The learned latent representations capture semantic attributes and correlations, enabling vector arithmetic to edit attributes and enabling data-driven analyses such as attribute correlation and t-SNE visualizations, highlighting the practical impact for perceptual generative modeling and attribute-centric face analysis.

Abstract

We present a novel method for constructing Variational Autoencoder (VAE). Instead of using pixel-by-pixel loss, we enforce deep feature consistency between the input and the output of a VAE, which ensures the VAE's output to preserve the spatial correlation characteristics of the input, thus leading the output to have a more natural visual appearance and better perceptual quality. Based on recent deep learning works such as style transfer, we employ a pre-trained deep convolutional neural network (CNN) and use its hidden features to define a feature perceptual loss for VAE training. Evaluated on the CelebA face dataset, we show that our model produces better results than other methods in the literature. We also show that our method can produce latent vectors that can capture the semantic information of face expressions and can be used to achieve state-of-the-art performance in facial attribute prediction.

Deep Feature Consistent Variational Autoencoder

TL;DR

This work replaces pixel-wise reconstruction loss in Variational Autoencoders with a deep feature perceptual loss computed from a fixed pre-trained CNN, improving perceptual quality of generated faces while preserving latent-space structure. The authors design a deep CNN-based CVAE with 4-layer encoder/decoder and a VGG-based loss network, training with KL regularization and multi-layer feature losses across relu1_2, relu2_1, and relu3_1. Experiments on CelebA show clearer reconstructions than a plain VAE and comparable or better results than DCGAN in some aspects, with a latent space that enables smooth interpolations, attribute manipulation, and effective facial attribute prediction (86.95% average accuracy). The learned latent representations capture semantic attributes and correlations, enabling vector arithmetic to edit attributes and enabling data-driven analyses such as attribute correlation and t-SNE visualizations, highlighting the practical impact for perceptual generative modeling and attribute-centric face analysis.

Abstract

We present a novel method for constructing Variational Autoencoder (VAE). Instead of using pixel-by-pixel loss, we enforce deep feature consistency between the input and the output of a VAE, which ensures the VAE's output to preserve the spatial correlation characteristics of the input, thus leading the output to have a more natural visual appearance and better perceptual quality. Based on recent deep learning works such as style transfer, we employ a pre-trained deep convolutional neural network (CNN) and use its hidden features to define a feature perceptual loss for VAE training. Evaluated on the CelebA face dataset, we show that our model produces better results than other methods in the literature. We also show that our method can produce latent vectors that can capture the semantic information of face expressions and can be used to achieve state-of-the-art performance in facial attribute prediction.

Paper Structure

This paper contains 16 sections, 5 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Model Overview. The left is a deep CNN-based Variational Autoencoder, and the right is a pretrained deep CNN used to compute feature perceptual loss.
  • Figure 2: Autoencoder network architecture. The left is encoder network, and the right is decoder network.
  • Figure 3: Generated fake face images from 100-dimension latent vector $z \sim \mathcal{N}(0, 1)$ from different models. The first part is generated from decoder network of plain variational autoencoder (PVAE) trained with pixel-based loss kingma2013auto, the second part is generated from generator network of DCGAN radford2015unsupervised, and the third part is our method trained with feature perceptual loss.
  • Figure 4: Image reconstruction from different models. The first row is input image, the second row is generated from decoder network of plain variational autoencoder (PVAE) trained with pixel-based loss kingma2013auto, and the last row is our method trained with feature perceptual loss.
  • Figure 5: Linear interpolation for latent vector. Each row is the interpolation from left latent vector $z_{left}$ to right latent vector $z_{right}$. e.g. $(1-\alpha) z_{left} + \alpha z_{right}$. The first row is transitions from a non-smiling woman to a smiling woman, the second row is transitions from a man without sunglass to a man with sunglass, the third row is transitions from a man to a woman, and the last row is transitions between two fake faces decoded from $z \sim \mathcal{N}(0, 1)$.
  • ...and 3 more figures