Table of Contents
Fetching ...

BalLOT: Balanced $k$-means clustering with optimal transport

Wenyan Luo, Dustin G. Mixon

TL;DR

The paper tackles balanced $k$-means clustering by formulating the assignment step as a balanced optimal-transport problem (BalLOT), pairing OT with centroid updates to achieve scalable, balanced partitions. It introduces an entropically regularized variant (E-BalLOT) for speed, and proves that BalLOT yields integral couplings for generic data, with a benign population landscape that aligns local minima with planted clusters. Finite-sample guarantees include a basin-of-attraction analysis and one-step recovery under suitable initializations, plus probabilistic misclustering bounds under the stochastic ball model. Empirically, BalLOT and E-BalLOT show near-linear scaling and competitive exact recovery against SDP and matching-based methods, validating their practical utility for large-scale, balanced clustering tasks.

Abstract

We consider the fundamental problem of balanced $k$-means clustering. In particular, we introduce an optimal transport approach to alternating minimization called BalLOT, and we show that it delivers a fast and effective solution to this problem. We establish this with a variety of numerical experiments before proving several theoretical guarantees. First, we prove that for generic data, BalLOT produces integral couplings at each step. Next, we perform a landscape analysis to provide theoretical guarantees for both exact and partial recoveries of planted clusters under the stochastic ball model. Finally, we propose initialization schemes that achieve one-step recovery of planted clusters.

BalLOT: Balanced $k$-means clustering with optimal transport

TL;DR

The paper tackles balanced -means clustering by formulating the assignment step as a balanced optimal-transport problem (BalLOT), pairing OT with centroid updates to achieve scalable, balanced partitions. It introduces an entropically regularized variant (E-BalLOT) for speed, and proves that BalLOT yields integral couplings for generic data, with a benign population landscape that aligns local minima with planted clusters. Finite-sample guarantees include a basin-of-attraction analysis and one-step recovery under suitable initializations, plus probabilistic misclustering bounds under the stochastic ball model. Empirically, BalLOT and E-BalLOT show near-linear scaling and competitive exact recovery against SDP and matching-based methods, validating their practical utility for large-scale, balanced clustering tasks.

Abstract

We consider the fundamental problem of balanced -means clustering. In particular, we introduce an optimal transport approach to alternating minimization called BalLOT, and we show that it delivers a fast and effective solution to this problem. We establish this with a variety of numerical experiments before proving several theoretical guarantees. First, we prove that for generic data, BalLOT produces integral couplings at each step. Next, we perform a landscape analysis to provide theoretical guarantees for both exact and partial recoveries of planted clusters under the stochastic ball model. Finally, we propose initialization schemes that achieve one-step recovery of planted clusters.

Paper Structure

This paper contains 13 sections, 10 theorems, 53 equations, 9 figures.

Key Result

Theorem 5

For generic data $\bm{X}$, if the columns of the initialization $\bm{\mu}^0$ are distinct columns of $\bm{X}$ (of if they are generic members of $\mathbb{R}^d$), then for each BalLOT iteration $t=0,1,\ldots$, the minimizer of $f(\bm{F},\bm{\mu}^t)$ subject to $\bm{F}\in \mathcal{U}_{n,k}$ is unique

Figures (9)

  • Figure 1: For $100$ data points in $\mathbb{R}^2$ drawn from the balanced stochastic ball model with two clusters, and for various clustering algorithms, we plot the rate at which the planted clustering is exactly recovered as a function of the separation parameter $\Delta$. In this experiment, the Hungarian, Matchpair, and BalLOT approaches perform identically, so the light blue and red curves are covered by the orange curve. See Experiment \ref{['exp.exact recovery rate vs Delta']} for details.
  • Figure 2: For each $n\in \{2^2, 2^3,\dotsc , 2^{27} \}$, we draw $n$ data points in $\mathbb{R}^2$ from the balanced stochastic ball model with two clusters, and we plot the median runtime for different clustering algorithms. BalLOT, E-BalLOT, and Lloyd's algorithm all exhibit near-linear runtimes, while the others are super-linear.
  • Figure 3: Estimating a balanced Gaussian mixture model with cluster centroids. We ran the $k$-means++ initialization as a seed for BalLOT, E-BalLOT, and Lloyd's algorithm. For one run of this experiment, the resulting cluster centroids are displayed on the left, along with line segments that illustrate an optimal-transport correspondence with the ground truth means. On the right, we display box plots of the $2$-Wasserstein distances that result from several trials. See Experiment \ref{['exper.gmm']} for details.
  • Figure 4: Probability of BalLOT exactly recovering the planted clustering of a balanced stochastic ball model. White denotes probability $1$, and black probability $0$. The $k=2$ case is given on the left, while the $k=3$ case is on the right. For comparison, we plot the threshold given in Theorem \ref{['thm:informal_basin_of_attraction']}. See Experiments \ref{['exper.k=2']} and \ref{['exper.k=3']} for details.
  • Figure 5: Given data drawn from a balanced stochastic ball model with $k=2$, we run BalLOT and plot the misclustering rate in the first step. For comparison, we plot the $n\to\infty$ version of the threshold given in Theorem \ref{['thm.k=2 misclustering rate']}(b). See Experiment \ref{['exper.probabilistic log decaying rate']} for details.
  • ...and 4 more figures

Theorems & Definitions (19)

  • Definition 1: stochastic ball model
  • Theorem 5
  • Theorem 6
  • Definition 7
  • Theorem 8
  • Theorem 11
  • Theorem 12
  • Theorem 15
  • Theorem 17
  • proof
  • ...and 9 more