Table of Contents
Fetching ...

Making Method of Moments Great Again? -- How can GANs learn distributions

Yuanzhi Li, Zehao Dou

TL;DR

The paper analyzes the early training dynamics of Wasserstein GANs, arguing that the discriminator acts as a moment-matcher and that matching a small number of low-order moments suffices to learn a broad class of target distributions, including those generated by two-layer neural networks. Leveraging moment-based identification and tensor decomposition (via the CE-Matrix and a 3-generic condition), it proves that a two-layer learner can recover the ground-truth distribution with polynomial sample complexity and provides an algorithmic pathway to do so. Extensions to ReLU discriminators and higher-order activations are provided, along with experimental evidence showing the necessity of higher-order moments for successful learning. The work suggests a principled route to understanding GANs’ capacity to approximate complex distributions beyond local equilibria, with explicit rates and identifiability guarantees. Overall, it connects moment matching, tensor methods, and Wasserstein training to establish theoretical learnability results for structured generator/discriminator pairs.

Abstract

Generative Adversarial Networks (GANs) are widely used models to learn complex real-world distributions. In GANs, the training of the generator usually stops when the discriminator can no longer distinguish the generator's output from the set of training examples. A central question of GANs is that when the training stops, whether the generated distribution is actually close to the target distribution, and how the training process reaches to such configurations efficiently? In this paper, we established a theoretical results towards understanding this generator-discriminator training process. We empirically observe that during the earlier stage of the GANs training, the discriminator is trying to force the generator to match the low degree moments between the generator's output and the target distribution. Moreover, only by matching these empirical moments over polynomially many training examples, we prove that the generator can already learn notable class of distributions, including those that can be generated by two-layer neural networks.

Making Method of Moments Great Again? -- How can GANs learn distributions

TL;DR

The paper analyzes the early training dynamics of Wasserstein GANs, arguing that the discriminator acts as a moment-matcher and that matching a small number of low-order moments suffices to learn a broad class of target distributions, including those generated by two-layer neural networks. Leveraging moment-based identification and tensor decomposition (via the CE-Matrix and a 3-generic condition), it proves that a two-layer learner can recover the ground-truth distribution with polynomial sample complexity and provides an algorithmic pathway to do so. Extensions to ReLU discriminators and higher-order activations are provided, along with experimental evidence showing the necessity of higher-order moments for successful learning. The work suggests a principled route to understanding GANs’ capacity to approximate complex distributions beyond local equilibria, with explicit rates and identifiability guarantees. Overall, it connects moment matching, tensor methods, and Wasserstein training to establish theoretical learnability results for structured generator/discriminator pairs.

Abstract

Generative Adversarial Networks (GANs) are widely used models to learn complex real-world distributions. In GANs, the training of the generator usually stops when the discriminator can no longer distinguish the generator's output from the set of training examples. A central question of GANs is that when the training stops, whether the generated distribution is actually close to the target distribution, and how the training process reaches to such configurations efficiently? In this paper, we established a theoretical results towards understanding this generator-discriminator training process. We empirically observe that during the earlier stage of the GANs training, the discriminator is trying to force the generator to match the low degree moments between the generator's output and the target distribution. Moreover, only by matching these empirical moments over polynomially many training examples, we prove that the generator can already learn notable class of distributions, including those that can be generated by two-layer neural networks.

Paper Structure

This paper contains 37 sections, 39 theorems, 216 equations, 4 figures.

Key Result

Theorem 1

Suppose the target distribution $\mathcal{D}^\star$ is generated by $\mathcal{D}^\star = G_{\#}^*\mathcal{N}(0,I_d)$ for an unknown (poly-weight) two-layer neural network $G^\star$ with polynomial activations, then for a learner two-layer neural network $G$ with the same or larger width, for every $ Moreover, such a network $G$ can be found efficiently by a tensor decomposition algorithm.

Figures (4)

  • Figure 1: Illustration of how the generator learn to match the moments between the output and the true distribution on MNIST data set. X-axis: The average mismatch ($\ell_2$ error) between the moments. Y-axis: The number of training epochs. Top: A successful training where the generator indeed learn to match the moments. Bottom: A unsuccessful training where the generator get stuck and fail to match the moments. The two cases are generated under the same training configuration and different random seeds. Purple line: the error reaches $0.05$.
  • Figure 2: Left: Samples generated by a two-layer generator neural network with quadratic activation functions trained on LSUN church data set (right). The discriminator is a three-hidden-layer network with quadratic activation functions. Thus, the low-degree polynomial generator can already learn highly non-trivial images by merely matching the (local) moments up to order 8.
  • Figure 3: These five figures show us the Wasserstein loss (top) and the parameter distance (bottom) curvature when discriminators are set as quadratic function, cubic function, quartic function, 1-layer ReLU network and 2-layer ReLU network respectively.
  • Figure :

Theorems & Definitions (71)

  • Claim 1: Method of Moment Learning
  • Theorem : Main, Method of Moments, Sketched
  • Theorem : Main; Method of Moment
  • Definition 1: $(\tau,\mathcal{A})$-robustness
  • Theorem 1
  • Remark
  • Definition 2: CE-Matrix
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • ...and 61 more