Table of Contents
Fetching ...

Fisher GAN

Youssef Mroueh, Tom Sercu

TL;DR

The paper tackles GAN training instability by introducing Fisher IPM, a scale-invariant distribution distance achieved by constraining the critic's second-order moments in a data-dependent way.By interpreting the neural-network-parameterized critic as whitening mean embeddings, Fisher IPM yields a Mahalanobis-distance-based discrepancy that remains computationally efficient and avoids aggressive weight clipping or costly gradient penalties.The authors prove that, with full function capacity, Fisher IPM equals the Chi-squared distance, derive a practical ALM-based optimization algorithm for training, and provide generalization bounds for the learned critic; they validate the approach with stable training, fast convergence, and competitive semi-supervised results on standard benchmarks.

Abstract

Generative Adversarial Networks (GANs) are powerful models for learning complex distributions. Stable training of GANs has been addressed in many recent works which explore different metrics between distributions. In this paper we introduce Fisher GAN which fits within the Integral Probability Metrics (IPM) framework for training GANs. Fisher GAN defines a critic with a data dependent constraint on its second order moments. We show in this paper that Fisher GAN allows for stable and time efficient training that does not compromise the capacity of the critic, and does not need data independent constraints such as weight clipping. We analyze our Fisher IPM theoretically and provide an algorithm based on Augmented Lagrangian for Fisher GAN. We validate our claims on both image sample generation and semi-supervised classification using Fisher GAN.

Fisher GAN

TL;DR

The paper tackles GAN training instability by introducing Fisher IPM, a scale-invariant distribution distance achieved by constraining the critic's second-order moments in a data-dependent way.By interpreting the neural-network-parameterized critic as whitening mean embeddings, Fisher IPM yields a Mahalanobis-distance-based discrepancy that remains computationally efficient and avoids aggressive weight clipping or costly gradient penalties.The authors prove that, with full function capacity, Fisher IPM equals the Chi-squared distance, derive a practical ALM-based optimization algorithm for training, and provide generalization bounds for the learned critic; they validate the approach with stable training, fast convergence, and competitive semi-supervised results on standard benchmarks.

Abstract

Generative Adversarial Networks (GANs) are powerful models for learning complex distributions. Stable training of GANs has been addressed in many recent works which explore different metrics between distributions. In this paper we introduce Fisher GAN which fits within the Integral Probability Metrics (IPM) framework for training GANs. Fisher GAN defines a critic with a data dependent constraint on its second order moments. We show in this paper that Fisher GAN allows for stable and time efficient training that does not compromise the capacity of the critic, and does not need data independent constraints such as weight clipping. We analyze our Fisher IPM theoretically and provide an algorithm based on Augmented Lagrangian for Fisher GAN. We validate our claims on both image sample generation and semi-supervised classification using Fisher GAN.

Paper Structure

This paper contains 22 sections, 5 theorems, 115 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Consider the Fisher IPM for $\mathcal{F}$ being the space of all measurable functions endowed by $\frac{1}{2}(\mathbb{P}+\mathbb{Q})$, i.e. $\mathcal{F}:=\mathcal{L}_{2}(\pazocal{X},\frac{\mathbb{P}+\mathbb{Q}}{2})$. Define the Chi-squared distance between two distributions: The following holds true for any $\mathbb{P},\mathbb{Q}$, $\mathbb{P}\neq \mathbb{Q}$: 1) The Fisher IPM for $\mathcal{F}=\

Figures (7)

  • Figure 1: Illustration of Fisher IPM with Neural Networks. $\Phi_\omega$ is a convolutional neural network which defines the embedding space. $v$ is the direction in this embedding space with maximal mean separation $\left\langle{v},{\mu_{\omega}(\mathbb{P})-\mu_{\omega}(\mathbb{Q})}\right\rangle$, constrained by the hyperellipsoid $v^\top \, \Sigma_{\omega}(\mathbb{P};\mathbb{Q}) \, v = 1$.
  • Figure 2: Example on 2D synthetic data, where both $\textcolor{blue}{\mathbb{P}}$ and $\textcolor{red}{\mathbb{Q}}$ are fixed normal distributions with the same covariance and shifted means along the x-axis, see (a). Fig (b, c) show the exact $\chi_2$ distance from numerically integrating Eq (\ref{['eq:chi2']}), together with the estimate obtained from training a 5-layer MLP with layer size = 16 and LeakyReLU nonlinearity on different training sample sizes. The MLP is trained using Algorithm 1, where sampling from the generator is replaced by sampling from $\mathbb{Q}$, and the $\chi_2$ MLP estimate is computed with Equation (\ref{['eq:FisherIPM']}) on a large number of samples (i.e. out of sample estimate). We see in (b) that for large enough sample size, the MLP estimate is extremely good. In (c) we see that for smaller sample sizes, the MLP approximation bounds the ground truth $\chi_2$ from below (see Theorem 2) and converges to the ground truth roughly as $\mathcal{O}(\frac{1}{\sqrt{N}})$ (Theorem 3). We notice that when the distributions have small $\chi_2$ distance, a larger training size is needed to get a better estimate - again this is in line with Theorem 3.
  • Figure 3: Samples and plots of the loss $\hat{\mathcal{E}}(.)$, lagrange multiplier $\lambda$, and constraint $\hat{\Omega}(.)$ on 3 benchmark datasets. We see that during training as $\lambda$ grows slowly, the constraint becomes tight.
  • Figure 4: No Batch Norm: Training results from a critic $f$ without batch normalization. Fisher GAN (left) produces decent samples, while WGAN with weight clipping (right) does not. We hypothesize that this is due to the implicit whitening that Fisher GAN provides. (Note that WGAN-GP does also succesfully converge without BN gulrajani2017improved). For both models the learning rate was appropriately reduced.
  • Figure 5: CIFAR-10 inception scores under 3 training conditions. Corresponding samples are given in rows from top to bottom (a,b,c). The inception score plots are mirroring Figure 3 from gulrajani2017improved. Note In v1 of this paper, the baseline inception scores were underestimated because they were computed using too few samples. Note All inception scores are computed from the same tensorflow codebase, using the architecture described in appendix \ref{['sec:mdl2']}, and with weight initialization from a normal distribution with stdev=0.02. In Appendix \ref{['appendix:inception']} we show that these choices are also benefiting our WGAN-GP baseline.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Theorem 1: Chi-squared distance at full capacity
  • Theorem 2: Approximating Chi-squared distance in an arbitrary function space $\mathcal{H}$
  • Remark 1
  • proof : Proof of Theorem \ref{['theo:chisquarefullcapacity']}
  • proof : Proof of Theorem \ref{['theo:ChiSquareapproxinH']}
  • Theorem 3
  • proof : Proof of Theorem \ref{['theo:GenBounds']}
  • Lemma 1: Bounds with (Local) Rademacher Complexity IPMempbartlett2005
  • Lemma 2: Contraction Lemma bartlett2005