Table of Contents
Fetching ...

The Cramer Distance as a Solution to Biased Wasserstein Gradients

Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, Rémi Munos

TL;DR

The paper investigates why the Wasserstein distance can hinder SGD through biased sample gradients and proposes the Cramér distance as a geometry-preserving, SGD-friendly alternative. It formalizes a framework of ideal divergences, proves that the Cramér distance has unbiased gradients (unlike Wasserstein) while maintaining geometric sensitivity, and demonstrates a practical instantiation via Cramér GAN. Empirical results across ordinal regression and image generation show improved stability, diversity, and performance when using the Cramér distance. The work highlights the potential for unbiased, geometry-aware divergences to advance scalable learning in probabilistic modeling and generative modeling tasks.

Abstract

The Wasserstein probability metric has received much attention from the machine learning community. Unlike the Kullback-Leibler divergence, which strictly measures change in probability, the Wasserstein metric reflects the underlying geometry between outcomes. The value of being sensitive to this geometry has been demonstrated, among others, in ordinal regression and generative modelling. In this paper we describe three natural properties of probability divergences that reflect requirements from machine learning: sum invariance, scale sensitivity, and unbiased sample gradients. The Wasserstein metric possesses the first two properties but, unlike the Kullback-Leibler divergence, does not possess the third. We provide empirical evidence suggesting that this is a serious issue in practice. Leveraging insights from probabilistic forecasting we propose an alternative to the Wasserstein metric, the Cramér distance. We show that the Cramér distance possesses all three desired properties, combining the best of the Wasserstein and Kullback-Leibler divergences. To illustrate the relevance of the Cramér distance in practice we design a new algorithm, the Cramér Generative Adversarial Network (GAN), and show that it performs significantly better than the related Wasserstein GAN.

The Cramer Distance as a Solution to Biased Wasserstein Gradients

TL;DR

The paper investigates why the Wasserstein distance can hinder SGD through biased sample gradients and proposes the Cramér distance as a geometry-preserving, SGD-friendly alternative. It formalizes a framework of ideal divergences, proves that the Cramér distance has unbiased gradients (unlike Wasserstein) while maintaining geometric sensitivity, and demonstrates a practical instantiation via Cramér GAN. Empirical results across ordinal regression and image generation show improved stability, diversity, and performance when using the Cramér distance. The work highlights the potential for unbiased, geometry-aware divergences to advance scalable learning in probabilistic modeling and generative modeling tasks.

Abstract

The Wasserstein probability metric has received much attention from the machine learning community. Unlike the Kullback-Leibler divergence, which strictly measures change in probability, the Wasserstein metric reflects the underlying geometry between outcomes. The value of being sensitive to this geometry has been demonstrated, among others, in ordinal regression and generative modelling. In this paper we describe three natural properties of probability divergences that reflect requirements from machine learning: sum invariance, scale sensitivity, and unbiased sample gradients. The Wasserstein metric possesses the first two properties but, unlike the Kullback-Leibler divergence, does not possess the third. We provide empirical evidence suggesting that this is a serious issue in practice. Leveraging insights from probabilistic forecasting we propose an alternative to the Wasserstein metric, the Cramér distance. We show that the Cramér distance possesses all three desired properties, combining the best of the Wasserstein and Kullback-Leibler divergences. To illustrate the relevance of the Cramér distance in practice we design a new algorithm, the Cramér Generative Adversarial Network (GAN), and show that it performs significantly better than the related Wasserstein GAN.

Paper Structure

This paper contains 25 sections, 7 theorems, 63 equations, 10 figures.

Key Result

Proposition 1

The KL divergence has unbiased sample gradients (U), but is not scale sensitive (S).

Figures (10)

  • Figure 1: Leftmost. Target distribution. One outcome ($10$) is significantly more distant than the two others ($0$, $1$). Rest. Distributions minimizing the divergences discussed in this paper, under the constraint $Q(1) = Q(10)$. Both Wasserstein metric and Cramér distance underemphasize $Q(0)$ to better match the cumulative distribution function. The sample Wasserstein loss result is for $m=1$.
  • Figure 2: Left. Wasserstein distance in terms of SGD updates, minimizing the true or sample Wasserstein losses. Also shown are the distances for the KL and Cramér solutions. Results are averaged over $10$ random initializations, with error-bands indicating one standard deviation. Center. Ordinal regression on the Year Prediction MSD dataset. Learning curves report RMSE on test set. Right. The same in terms of sample Wasserstein loss.
  • Figure 3: Generated right halves of the faces for WGAN-GP (left) and Cramér GAN (right). The given left halves are from CelebA 64x64 validation set liu2015faceattributes.
  • Figure 4: Approximate Wasserstein distances between CelebA test set and each generator. $N_u$ indicates the number critic updates per generator update.
  • Figure 5: Wasserstein loss (black curve) $\theta\mapsto |\theta^*-\theta|$ versus expected sample Wasserstein loss (red curve) $\theta\mapsto \mathbb{E}[|\hat{\theta} - \theta|]$, for different values of $m$ and $\theta^*$ and $p=1$. Left:$m=1$, $\theta^*=0.6$. A stochastic gradient using a one-sample Wasserstein gradient estimate will converge to $1$ instead of $\theta^*$. Middle:$m=6$, $\theta^*=0.6$. The minimum of the expected sample Wasserstein loss is the median of $\hat{\theta}$ which is here $\tilde{\theta}=\tfrac{2}{3}\neq\theta^*=0.6$. Right:$m=5$, $p=0.9$. The minimum of the expected sample Wasserstein is $\tilde{\theta}=1$ and not $\theta^*=0.9$.
  • ...and 5 more figures

Theorems & Definitions (13)

  • Proposition 1
  • Proposition 2
  • Theorem 1
  • Theorem 2
  • proof : Proof (Proposition \ref{['prop:kl_prop']} and \ref{['prop:wasserstein_prop']})
  • proof : Proof (Theorem \ref{['thm:wasserstein_bias']}).
  • Theorem 3
  • proof
  • Lemma 1
  • proof
  • ...and 3 more