Table of Contents
Fetching ...

Stacked Generative Adversarial Networks

Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, Serge Belongie

TL;DR

SGAN introduces a top-down stack of GANs that inverts a pre-trained discriminative encoder by enforcing adversarial alignment of intermediate representations through representation discriminators. It adds a conditional loss to preserve higher-level conditioning and an entropy loss to promote diverse outputs via a variational lower bound on H(hat{h}_i|h_{i+1}). Training proceeds from independent per-stack objectives to end-to-end joint optimization, enabling hierarchical decomposition of variation and conditioning on class labels. On MNIST, SVHN, and CIFAR-10, SGAN achieves higher image quality and diversity than vanilla GAN variants, with state-of-the-art Inception scores on CIFAR-10 and strong human-perceived realism in Visual Turing Tests. The work demonstrates that leveraging hierarchical discriminative representations can substantially improve generative modeling while enhancing interpretability through multi-level latent structure.

Abstract

In this paper, we propose a novel generative model named Stacked Generative Adversarial Networks (SGAN), which is trained to invert the hierarchical representations of a bottom-up discriminative network. Our model consists of a top-down stack of GANs, each learned to generate lower-level representations conditioned on higher-level representations. A representation discriminator is introduced at each feature hierarchy to encourage the representation manifold of the generator to align with that of the bottom-up discriminative network, leveraging the powerful discriminative representations to guide the generative model. In addition, we introduce a conditional loss that encourages the use of conditional information from the layer above, and a novel entropy loss that maximizes a variational lower bound on the conditional entropy of generator outputs. We first train each stack independently, and then train the whole model end-to-end. Unlike the original GAN that uses a single noise vector to represent all the variations, our SGAN decomposes variations into multiple levels and gradually resolves uncertainties in the top-down generative process. Based on visual inspection, Inception scores and visual Turing test, we demonstrate that SGAN is able to generate images of much higher quality than GANs without stacking.

Stacked Generative Adversarial Networks

TL;DR

SGAN introduces a top-down stack of GANs that inverts a pre-trained discriminative encoder by enforcing adversarial alignment of intermediate representations through representation discriminators. It adds a conditional loss to preserve higher-level conditioning and an entropy loss to promote diverse outputs via a variational lower bound on H(hat{h}_i|h_{i+1}). Training proceeds from independent per-stack objectives to end-to-end joint optimization, enabling hierarchical decomposition of variation and conditioning on class labels. On MNIST, SVHN, and CIFAR-10, SGAN achieves higher image quality and diversity than vanilla GAN variants, with state-of-the-art Inception scores on CIFAR-10 and strong human-perceived realism in Visual Turing Tests. The work demonstrates that leveraging hierarchical discriminative representations can substantially improve generative modeling while enhancing interpretability through multi-level latent structure.

Abstract

In this paper, we propose a novel generative model named Stacked Generative Adversarial Networks (SGAN), which is trained to invert the hierarchical representations of a bottom-up discriminative network. Our model consists of a top-down stack of GANs, each learned to generate lower-level representations conditioned on higher-level representations. A representation discriminator is introduced at each feature hierarchy to encourage the representation manifold of the generator to align with that of the bottom-up discriminative network, leveraging the powerful discriminative representations to guide the generative model. In addition, we introduce a conditional loss that encourages the use of conditional information from the layer above, and a novel entropy loss that maximizes a variational lower bound on the conditional entropy of generator outputs. We first train each stack independently, and then train the whole model end-to-end. Unlike the original GAN that uses a single noise vector to represent all the variations, our SGAN decomposes variations into multiple levels and gradually resolves uncertainties in the top-down generative process. Based on visual inspection, Inception scores and visual Turing test, we demonstrate that SGAN is able to generate images of much higher quality than GANs without stacking.

Paper Structure

This paper contains 14 sections, 8 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: An overview of SGAN. (a) The original GAN in goodfellow2014generative. (b) The workflow of training SGAN, where each generator $G_i$ tries to generate plausible features that can fool the corresponding representation discriminator $D_i$. Each generator receives conditional input from encoders in the independent training stage, and from the upper generators in the joint training stage. (c) New images can be sampled from SGAN (during test time) by feeding random noise to each generator $G_i$.
  • Figure 2: MNIST results. (a) Samples generated by SGAN when conditioned on class labels. (b) Corresponding nearest neighbor images in the training set. (c) Samples generated by the bottom GAN when conditioned on a fixed fc3 feature activation, generated by the top GAN. (d) Same as (c), but the bottom GAN is trained without entropy loss.
  • Figure 3: SVHN results. (a) Samples generated by SGAN when conditioned on class labels. (b) Corresponding nearest neighbor images in the training set. (c) Samples generated by the bottom GAN when conditioned on a fixed fc3 feature activation, generated by the top GAN. (d) Same as (c), but the bottom GAN is trained without entropy loss.
  • Figure 4: MNIST results. (a) Samples generated by SGAN when conditioned on class labels. (b) Corresponding nearest neighbor images in the training set. (c) Samples generated by the bottom GAN when conditioned on a fixed fc3 feature activation, generated by the top GAN. (d) Same as (c), but the bottom GAN is trained without entropy loss.
  • Figure 5: Ablation studies on CIFAR-10. Samples from (a) full SGAN (b) SGAN without joint training. (c) DCGAN trained with $\mathcal{L}^{adv}+\mathcal{L}^{cond}+\mathcal{L}^{ent}$ (d) DCGAN trained with $\mathcal{L}^{adv}+\mathcal{L}^{cond}$ (e) DCGAN trained with $\mathcal{L}^{adv}+\mathcal{L}^{ent}$ (f) DCGAN trained with $\mathcal{L}^{adv}$.