Weighted quantization using MMD: From mean field to mean shift via gradient flows
Ayoub Belhadji, Daniel Sharp, Youssef Marzouk
TL;DR
This work addresses the problem of approximating a target distribution $\pi$ by a weighted $M$-point Dirac mixture to minimize the maximum mean discrepancy $\mathrm{MMD}$. It introduces a Wasserstein–Fisher–Rao gradient flow and its practical discretization as an interacting-particle system (IPS), along with a fixed-point scheme called mean shift interacting particles (MSIP) that extends mean shift and acts as a preconditioned gradient descent for MMD minimization. By unifying gradient flows, mean shift, and kernel-based quantization, the authors derive robust, scalable algorithms that perform well in high-dimensional and multi-modal settings, as demonstrated on Gaussian mixtures and MNIST. The proposed MSIP and WFR-IPS show improved robustness to initialization and deliver near-optimal MMD quantizations, with potential implications for efficient kernel quadrature and mode-seeking in complex distributions.
Abstract
Approximating a probability distribution using a set of particles is a fundamental problem in machine learning and statistics, with applications including clustering and quantization. Formally, we seek a weighted mixture of Dirac measures that best approximates the target distribution. While much existing work relies on the Wasserstein distance to quantify approximation errors, maximum mean discrepancy (MMD) has received comparatively less attention, especially when allowing for variable particle weights. We argue that a Wasserstein-Fisher-Rao gradient flow is well-suited for designing quantizations optimal under MMD. We show that a system of interacting particles satisfying a set of ODEs discretizes this flow. We further derive a new fixed-point algorithm called mean shift interacting particles (MSIP). We show that MSIP extends the classical mean shift algorithm, widely used for identifying modes in kernel density estimators. Moreover, we show that MSIP can be interpreted as preconditioned gradient descent and that it acts as a relaxation of Lloyd's algorithm for clustering. Our unification of gradient flows, mean shift, and MMD-optimal quantization yields algorithms that are more robust than state-of-the-art methods, as demonstrated via high-dimensional and multi-modal numerical experiments.
