Generative Sliced MMD Flows with Riesz Kernels

Johannes Hertrich; Christian Wald; Fabian Altekrüger; Paul Hagemann

Generative Sliced MMD Flows with Riesz Kernels

Johannes Hertrich, Christian Wald, Fabian Altekrüger, Paul Hagemann

TL;DR

The paper addresses the computational bottleneck of maximum mean discrepancy (MMD) in high dimensions by exploiting Riesz kernels, showing that the MMD equals the sliced MMD under these kernels and enabling 1D gradient computations. For the case $r=1$, a sorting-based method reduces gradient evaluation to $O((M+N)\log(M+N))$, and a finite number of projections yields a stochastic gradient estimate with error $O(\sqrt{d/P})$, making large-scale gradient-flow training tractable. The authors formulate Generative MMD Flows using a discretized gradient flow with optional momentum and train a sequence of neural networks to approximate the steps, achieving scalable image generation on standard benchmarks. They also connect sliced MMD to the Wasserstein-1 distance, provide explicit constants, and validate the approach with extensive experiments on MNIST, FashionMNIST, CIFAR10, and CelebA. Overall, the work offers a practical, efficient framework for gradient-flow-based generative modelling via sliced MMD with Riesz kernels and demonstrates strong empirical performance.

Abstract

Maximum mean discrepancy (MMD) flows suffer from high computational costs in large scale computations. In this paper, we show that MMD flows with Riesz kernels $K(x,y) = - \|x-y\|^r$, $r \in (0,2)$ have exceptional properties which allow their efficient computation. We prove that the MMD of Riesz kernels, which is also known as energy distance, coincides with the MMD of their sliced version. As a consequence, the computation of gradients of MMDs can be performed in the one-dimensional setting. Here, for $r=1$, a simple sorting algorithm can be applied to reduce the complexity from $O(MN+N^2)$ to $O((M+N)\log(M+N))$ for two measures with $M$ and $N$ support points. As another interesting follow-up result, the MMD of compactly supported measures can be estimated from above and below by the Wasserstein-1 distance. For the implementations we approximate the gradient of the sliced MMD by using only a finite number $P$ of slices. We show that the resulting error has complexity $O(\sqrt{d/P})$, where $d$ is the data dimension. These results enable us to train generative models by approximating MMD gradient flows by neural networks even for image applications. We demonstrate the efficiency of our model by image generation on MNIST, FashionMNIST and CIFAR10.

Generative Sliced MMD Flows with Riesz Kernels

TL;DR

, a sorting-based method reduces gradient evaluation to

, and a finite number of projections yields a stochastic gradient estimate with error

, making large-scale gradient-flow training tractable. The authors formulate Generative MMD Flows using a discretized gradient flow with optional momentum and train a sequence of neural networks to approximate the steps, achieving scalable image generation on standard benchmarks. They also connect sliced MMD to the Wasserstein-1 distance, provide explicit constants, and validate the approach with extensive experiments on MNIST, FashionMNIST, CIFAR10, and CelebA. Overall, the work offers a practical, efficient framework for gradient-flow-based generative modelling via sliced MMD with Riesz kernels and demonstrates strong empirical performance.

Abstract

Maximum mean discrepancy (MMD) flows suffer from high computational costs in large scale computations. In this paper, we show that MMD flows with Riesz kernels

have exceptional properties which allow their efficient computation. We prove that the MMD of Riesz kernels, which is also known as energy distance, coincides with the MMD of their sliced version. As a consequence, the computation of gradients of MMDs can be performed in the one-dimensional setting. Here, for

, a simple sorting algorithm can be applied to reduce the complexity from

for two measures with

and

support points. As another interesting follow-up result, the MMD of compactly supported measures can be estimated from above and below by the Wasserstein-1 distance. For the implementations we approximate the gradient of the sliced MMD by using only a finite number

of slices. We show that the resulting error has complexity

, where

is the data dimension. These results enable us to train generative models by approximating MMD gradient flows by neural networks even for image applications. We demonstrate the efficiency of our model by image generation on MNIST, FashionMNIST and CIFAR10.

Paper Structure (17 sections, 9 theorems, 67 equations, 10 figures, 2 tables, 3 algorithms)

This paper contains 17 sections, 9 theorems, 67 equations, 10 figures, 2 tables, 3 algorithms.

Introduction
Sliced MMD for Riesz Kernels
Gradients of Sliced MMD
Generative MMD Flows
MMD Particle Flows
Generative MMD Flows
Numerical Examples
Conclusions
Proof of Theorem \ref{['sliced:unsliced']}
Proof of Theorem \ref{['thm:rel']}
Proof of Theorem \ref{['thm:sorting']}
Proof of Theorem \ref{['thm:convergence_rate']}
Comparison of Different Kernels in MMD
Training Algorithm of the Generative Sliced MMD Flow
Ablation Study
...and 2 more sections

Key Result

Theorem 1

Let ${\mathrm k}(x,y) \coloneqq -|x-y|^r$, $r\in(0,2)$. Then, for $\mu, \nu \in \mathcal{P}_r(\mathbb{R}^d)$, it holds $\mathcal{SD}_{\mathrm k}^2(\mu,\nu) = \mathcal{D}_{\mathrm{K}}^2(\mu,\nu)$ with the associated scaled Riesz kernel

Figures (10)

Figure 1: Left: Comparison of run time for $1000$ gradient evaluations of naive MMD and sliced MMD with different number of projections $P$ in the case $d=100$. Middle and right: Relative error of the gradients of sliced MMD and MMD with respect to the number $P$ of projections and the dimension $d$. The results show the relative error behaves asymptotically as $O(\sqrt{d/P})$ as shown in Theorem \ref{['thm:convergence_rate']}.
Figure 2: Samples and their trajectories from MNIST (left) and CIFAR10 (right) in the MMD flow with momentum (\ref{['eq:Mom_MMD_GD']}, top) and without momentum (\ref{['eq:MMD_GD']}, bottom) starting in the uniform distribution on $[0,1]^d$ after $2^k$ steps with $k\in\{0,...,16\}$ (for MNIST) and $k\in\{3,...,19\}$ (for CIFAR10). We observe that the momentum MMD flow \ref{['eq:Mom_MMD_GD']} converges faster than the MMD flow \ref{['eq:MMD_GD']} without momentum.
Figure 3: Generated samples of our generative MMD Flow.
Figure 4: Comparison of the MMD flow with Gaussian kernel (top) and inverse multiquadric kernel (bottom) for different hyperparameters.
Figure 5: Comparison of the MMD flow with Laplacian kernel (top) and Riesz kernel (bottom) for different hyperparameters.
...and 5 more figures

Theorems & Definitions (17)

Theorem 1: Sliced Riesz Kernels are Riesz Kernels
Theorem 2: Relation between $\mathcal{D}_K$ and $\mathcal{W}_1$ for Distance Kernels
Theorem 3: Derivatives of Interaction and Potential Energy
Theorem 4: Error Bound for Stochastic MMD Gradients
Remark 5: Computational Complexity of Gradient Evaluations
Remark 6: Iterative Training and Sampling
Remark 7: Extension to $\mathcal{P}_{\frac{r}{2}}(\mathbb{R}^d)$
Lemma 8
Lemma 9
proof
...and 7 more

Generative Sliced MMD Flows with Riesz Kernels

TL;DR

Abstract

Generative Sliced MMD Flows with Riesz Kernels

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (17)