Table of Contents
Fetching ...

Double Descent and Other Interpolation Phenomena in GANs

Lorenzo Luzi, Yehuda Dar, Richard Baraniuk

TL;DR

This work addresses how overparameterization affects generalization in GANs, focusing on the latent-dimension as the key source of parameterization. It shows that training GANs by minimizing distribution metrics or $f$-divergences yields constant test error across interpolating solutions, while a pseudo-supervised scheme—pairing fabricated latent vectors with real outputs—induces double (and sometimes triple) descent and speeds training. The authors develop multiple pseudo-supervised formulations, establish theoretical properties of the solution sets, and demonstrate substantial empirical gains for linear and nonlinear GANs (including MNIST and CelebA) in faster convergence and improved generalization. This work highlights a practical pathway to leverage overparameterization in unsupervised generative modeling and motivates further exploration of pseudo-supervision in large-scale settings.

Abstract

We study overparameterization in generative adversarial networks (GANs) that can interpolate the training data. We show that overparameterization can improve generalization performance and accelerate the training process. We study the generalization error as a function of latent space dimension and identify two main behaviors, depending on the learning setting. First, we show that overparameterized generative models that learn distributions by minimizing a metric or $f$-divergence do not exhibit double descent in generalization errors; specifically, all the interpolating solutions achieve the same generalization error. Second, we develop a novel pseudo-supervised learning approach for GANs where the training utilizes pairs of fabricated (noise) inputs in conjunction with real output samples. Our pseudo-supervised setting exhibits double descent (and in some cases, triple descent) of generalization errors. We combine pseudo-supervision with overparameterization (i.e., overly large latent space dimension) to accelerate training while matching or even surpassing generalization performance without pseudo-supervision. While our analysis focuses mostly on linear models, we also apply important insights for improving generalization of nonlinear, multilayer GANs.

Double Descent and Other Interpolation Phenomena in GANs

TL;DR

This work addresses how overparameterization affects generalization in GANs, focusing on the latent-dimension as the key source of parameterization. It shows that training GANs by minimizing distribution metrics or -divergences yields constant test error across interpolating solutions, while a pseudo-supervised scheme—pairing fabricated latent vectors with real outputs—induces double (and sometimes triple) descent and speeds training. The authors develop multiple pseudo-supervised formulations, establish theoretical properties of the solution sets, and demonstrate substantial empirical gains for linear and nonlinear GANs (including MNIST and CelebA) in faster convergence and improved generalization. This work highlights a practical pathway to leverage overparameterization in unsupervised generative modeling and motivates further exploration of pseudo-supervision in large-scale settings.

Abstract

We study overparameterization in generative adversarial networks (GANs) that can interpolate the training data. We show that overparameterization can improve generalization performance and accelerate the training process. We study the generalization error as a function of latent space dimension and identify two main behaviors, depending on the learning setting. First, we show that overparameterized generative models that learn distributions by minimizing a metric or -divergence do not exhibit double descent in generalization errors; specifically, all the interpolating solutions achieve the same generalization error. Second, we develop a novel pseudo-supervised learning approach for GANs where the training utilizes pairs of fabricated (noise) inputs in conjunction with real output samples. Our pseudo-supervised setting exhibits double descent (and in some cases, triple descent) of generalization errors. We combine pseudo-supervision with overparameterization (i.e., overly large latent space dimension) to accelerate training while matching or even surpassing generalization performance without pseudo-supervision. While our analysis focuses mostly on linear models, we also apply important insights for improving generalization of nonlinear, multilayer GANs.

Paper Structure

This paper contains 24 sections, 12 theorems, 35 equations, 15 figures, 1 table.

Key Result

Corollary 1

There is no double descent in generative models that minimize a metric or $f$-divergence, e.g., PCA for subspace learning, Jensen-Shannon GANs, WGANs, etc.

Figures (15)

  • Figure 1: PCA and linear GAN's test error becomes constant when the model interpolates, i.e., when the latent dimensionality $k$ equals the number of training samples $n$. Therefore, the overparameterized regime does not exhibit double descent but rather a constant error. The test error achieves its minimum when the latent dimensionality $k$ is near the true model's dimensionality $m$. The train errors (left subfigure) and test errors (right subfigure) are calculated with the $2$-Wasserstein metric.
  • Figure 2: The fully supervised model achieves a peak when the latent dimensionality $k$ is equal to the number of training samples $n$. The unsupervised model stops changing as soon as it interpolates at $k = n$. The semi-supervised model with $n_\text{sup} = 12$ behaves in a way that is somewhat in-between the other two. For other values of $n_\text{sup}$ and implementation details, see \ref{['app:exp on linear stuff']}.
  • Figure 3: Evaluation of test error and training convergence speed in learning of linear GANs using the three different training loss formulations in (\ref{['eq:training loss - gradient type 1']}),(\ref{['eq:training loss - gradient type 2']}),(\ref{['eq:training loss - gradient type 3']}). In the first column of subfigures, we use (\ref{['eq:training loss - gradient type 1']}) and get double descent that beats the unsupervised baseline in both generalization performance and convergence speed in the overparameterized range of solutions (the baseline corresponds to the case of no pseudo-supervised training samples $n_\text{ps} = 0$). In the second column of subfigures, we use (\ref{['eq:training loss - gradient type 2']}) and squash the double descent to get lower generalization error for small latent dimensionality $k$. In the third column of subfigures, we get triple descent (one peak at $k=n$ and one peak at $k=d$) as well as low generalization errors and extremely fast training speed for large $k$. In these experiments, the true data is $m = 10$ dimensional, the data space is $d = 64$ dimensional, and we have $n=20$ total training data samples. The null estimator ($\mathbf G = {\bm{0}_{d \times k}}$) achieves a test error of approximately $13$, so all of these models perform better for large enough $k$. For additional plots, see \ref{['app:exp on linear stuff']}.
  • Figure 4: Test errors for multilayer, nonlinear GANs trained on the MNIST digit dataset. On the left we see that the baseline error resembles a noisy version of the test error in \ref{['fig:spoon_demo']}, characterized by an initial dip and then high levels of error. Our pseudo-supervision training beats the baseline here. As we continue to train (epoch 2052), we see that the baseline error reduces, which may be due to some kind of implicit regularization. On the right, our pseudo-supervised model achieves double descent at epoch 3000. Here the test error is measured by geometry score.
  • Figure 5: These test error heatmaps for multilayer, nonlinear GANs trained on MNIST show that the pseudo-supervised models converge faster than the baseline models. The baseline model has high test error until around epoch 1500, unlike the pseudo-supervised models which have the test error drop off at around epoch 750. The baseline model only beats the pseudo-supervised model later in the training (around epoch 2500), when the pseudo-supervised loss increases and admits a double descent shape. The test error is measured by geometry score here. The $k$-axis is plotted so that each column corresponds to the next entry for better visualization, even though the spacing is $k \in \{1, 2, 4, 6, \dots, 70, 100, 200, 300, \dots, 700\}$.
  • ...and 10 more figures

Theorems & Definitions (21)

  • proof
  • Corollary 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Proposition 1
  • proof
  • Definition 1
  • Theorem 5
  • ...and 11 more