Table of Contents
Fetching ...

Probabilistic Matching of Real and Generated Data Statistics in Generative Adversarial Networks

Philipp Pilar, Niklas Wahlström

TL;DR

A method to ensure that the distributions of certain generated data statistics coincide with the respective distributions of the real data, by adding a new loss term to the generator loss function, which quantifies the difference between these distributions via suitable f-divergences.

Abstract

Generative adversarial networks constitute a powerful approach to generative modeling. While generated samples often are indistinguishable from real data, there is no guarantee that they will follow the true data distribution. For scientific applications in particular, it is essential that the true distribution is well captured by the generated distribution. In this work, we propose a method to ensure that the distributions of certain generated data statistics coincide with the respective distributions of the real data. In order to achieve this, we add a new loss term to the generator loss function, which quantifies the difference between these distributions via suitable f-divergences. Kernel density estimation is employed to obtain representations of the true distributions, and to estimate the corresponding generated distributions from minibatch values at each iteration. When compared to other methods, our approach has the advantage that the complete shapes of the distributions are taken into account. We evaluate the method on a synthetic dataset and a real-world dataset and demonstrate improved performance of our approach.

Probabilistic Matching of Real and Generated Data Statistics in Generative Adversarial Networks

TL;DR

A method to ensure that the distributions of certain generated data statistics coincide with the respective distributions of the real data, by adding a new loss term to the generator loss function, which quantifies the difference between these distributions via suitable f-divergences.

Abstract

Generative adversarial networks constitute a powerful approach to generative modeling. While generated samples often are indistinguishable from real data, there is no guarantee that they will follow the true data distribution. For scientific applications in particular, it is essential that the true distribution is well captured by the generated distribution. In this work, we propose a method to ensure that the distributions of certain generated data statistics coincide with the respective distributions of the real data. In order to achieve this, we add a new loss term to the generator loss function, which quantifies the difference between these distributions via suitable f-divergences. Kernel density estimation is employed to obtain representations of the true distributions, and to estimate the corresponding generated distributions from minibatch values at each iteration. When compared to other methods, our approach has the advantage that the complete shapes of the distributions are taken into account. We evaluate the method on a synthetic dataset and a real-world dataset and demonstrate improved performance of our approach.
Paper Structure (23 sections, 16 equations, 15 figures, 12 tables, 3 algorithms)

This paper contains 23 sections, 16 equations, 15 figures, 12 tables, 3 algorithms.

Figures (15)

  • Figure 1: The various representations involved in matching the statistic $z_s$ are depicted. The histogram in the background shows the true data distribution. Left: Representation of the true data distribution. Middle left: Representation of the generated data distribution with batch size 64 for different choices of $\sigma$ in the kernel. Middle right: Representation of the generated data distribution for various batch sizes with optimal choice for $\sigma$ (as determined via Algorithm \ref{['alg:fsig']} in the appendix). Right: Taking the recent minibatch history into account (here with $\epsilon = 0.9$) can smoothen out fluctuations and lead to a more accurate representation. In this figure, a perfectly trained generator has been assumed, i.e. the minibatches have been sampled from real data.
  • Figure 2: (Synthetic example) The distributions of three different power spectrum components $\text{ps}$ as obtained by the different models are depicted, where the orange lines show the true distribution as obtained via KDE \ref{['eq:ptrue']}. From left to right, the histograms correspond to the real data, the pcGAN, the method of Wu2020, WGAN, WGAN-GP, and SNGAN. For the histograms, $20\,000$ generated samples have been considered (or the full dataset, in case of the real distribution). Parameters for the pcGAN: $\text{bs}=256$, $\lambda=500$, $\epsilon=0.9$, $h=\text{KL}$.
  • Figure 3: (Synthetic example) Different values of the weighting coefficient $\lambda$ are considered (with $\text{bs}=256$, $\epsilon=0.9$, $h=\text{KL}$). Ten runs have been conducted per model, and the mean values plus-or-minus one standard deviation are depicted.
  • Figure 4: (Synthetic example) Different batch sizes with and without historical averaging are considered (with $\lambda=500$, $h=\text{KL}$). The different colors indicate which points belong to the same batch size. Ten runs have been conducted per model, and the mean values plus-or-minus one standard deviation are depicted.
  • Figure 5: (IceCube-Gen2) The distributions of minimum and maximum values as obtained by different models are compared, where the orange lines show the true distribution as obtained via KDE \ref{['eq:ptrue']}. From left to right, the histograms correspond to the real data, the pcGAN, the method of Wu2020, WGAN, WGAN-GP, and SNGAN. For the histograms, $20\,000$ generated samples have been considered (or the full dataset, in case of the real distribution). Parameters for the pcGAN: $\text{bs}=256$, $\lambda=2$, $\epsilon=0.9$, $h=\text{KL}$.
  • ...and 10 more figures