Table of Contents
Fetching ...

Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming

Yubo Zhuang, Xiaohui Chen, Yun Yang, Richard Y. Zhang

TL;DR

This work addresses the computational bottleneck of SDP-based K-means clustering by introducing a nonnegative low-rank SDP (NLR) that factorizes the SDP variable as $Z=UU^T$ with $U\ge 0$ and solves the resulting nonconvex problem via a primal–dual gradient-descent ALM. The method retains the strong, information-theoretic exact-recovery guarantees of the SDP under Gaussian mixtures while achieving scalability comparable to NMF, thanks to a low-rank parameterization reducing variables to $O(nr)$. The authors prove local linear convergence of the projected-gradient updates to the SDP solution in the exact-recovery regime, with Phase 1 and Phase 2 phases yielding contraction at a rate $\gamma=1-O(K^{-6})$ and total complexity $O(K^6 n r)$ under suitable tuning. Empirically, NLR matches SDP in mis-clustering performance and outperforms NMF, spectral clustering, and K-means variants on large-scale datasets, while maintaining scalability on hundreds of thousands of points and demonstrating robustness beyond Gaussian assumptions.

Abstract

$K$-means clustering is a widely used machine learning method for identifying patterns in large datasets. Recently, semidefinite programming (SDP) relaxations have been proposed for solving the $K$-means optimization problem, which enjoy strong statistical optimality guarantees. However, the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. In contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm widely used by machine learning practitioners, but it lacks a solid statistical underpinning and theoretical guarantees. In this paper, we consider an NMF-like algorithm that solves a nonnegative low-rank restriction of the SDP-relaxed $K$-means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is as simple and scalable as state-of-the-art NMF algorithms while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves significantly smaller mis-clustering errors compared to the existing state-of-the-art while maintaining scalability.

Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming

TL;DR

This work addresses the computational bottleneck of SDP-based K-means clustering by introducing a nonnegative low-rank SDP (NLR) that factorizes the SDP variable as with and solves the resulting nonconvex problem via a primal–dual gradient-descent ALM. The method retains the strong, information-theoretic exact-recovery guarantees of the SDP under Gaussian mixtures while achieving scalability comparable to NMF, thanks to a low-rank parameterization reducing variables to . The authors prove local linear convergence of the projected-gradient updates to the SDP solution in the exact-recovery regime, with Phase 1 and Phase 2 phases yielding contraction at a rate and total complexity under suitable tuning. Empirically, NLR matches SDP in mis-clustering performance and outperforms NMF, spectral clustering, and K-means variants on large-scale datasets, while maintaining scalability on hundreds of thousands of points and demonstrating robustness beyond Gaussian assumptions.

Abstract

-means clustering is a widely used machine learning method for identifying patterns in large datasets. Recently, semidefinite programming (SDP) relaxations have been proposed for solving the -means optimization problem, which enjoy strong statistical optimality guarantees. However, the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. In contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm widely used by machine learning practitioners, but it lacks a solid statistical underpinning and theoretical guarantees. In this paper, we consider an NMF-like algorithm that solves a nonnegative low-rank restriction of the SDP-relaxed -means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is as simple and scalable as state-of-the-art NMF algorithms while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves significantly smaller mis-clustering errors compared to the existing state-of-the-art while maintaining scalability.
Paper Structure (18 sections, 22 theorems, 181 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 22 theorems, 181 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Let $U^*$ denote a local minimizing point that satisfies the second-order sufficient conditions for an isolated local minimum with respect to multipliers $y^*$. Then, there exists a scalar $\beta^*\ge 0$ such that, for every $\beta > \beta^*$ the augmented Lagrangian $\mathcal{L}_\beta(U,y)$ has a u

Figures (6)

  • Figure 1: Log-scale trade-off plot of CPU time versus mis-clustering error, under an increasing number of data points $n$. Here, NLR corresponds to our proposed non-negative low-rank factorization method, KM corresponds to $K$-means++ arthur2007k, SC corresponds to spectral clustering NgJordanWeiss2001_NIPS, SDP PengWei2007_SIAMJOPTIM uses the SDPNAL+ solver YangSunToh2015_SDPNAL+, and NMF corresponds to non-negative factorization ding2005equivalence. We follow the same setting as the first experiment in Section \ref{['sec:num_exp']}, where the theoretically optimal mis-clustering error decays to zero as $n$ increases to infinity.
  • Figure 2: Log-scale plots with error bars of mis-clustering error (leftmost) and time cost (in the middle) as sample size $n$ increases and the convergence of NLR over iterations (rightmost). The plots are partial for SC (SDP) due to their huge space (time) complexity when the sample size is large.
  • Figure 3: Boxplots of mis-clustering error (with means) for CyTOF dataset (on the left) and CIFAR-10 (on the right) among five different methods. The plots are partial for SC (SDP) due to their huge space (time) complexity when the sample size is large.
  • Figure 4: Log-scale plots of the convergence of NLR algorithms over iterations per dual update. Left: the NLR algorithm where the Lagrangian function contains all constraints and the minimum of the augmented Lagrangian function is solved based on limited memory BFGS. Right: Our algorithm where the minimum of the augmented Lagrangian function is solved based on projected GD.
  • Figure 5: QQ-plots for Heart dataset. The first (second) row corresponds to three randomly selected covariates for the first (second) cluster in the Heart dataset.
  • ...and 1 more figures

Theorems & Definitions (22)

  • Proposition 1: Existence and quality of primal minimizer
  • Proposition 2: Linear convergence of dual multipliers
  • Theorem 1: Local convergence of projected gradient descent
  • Theorem 2: Feasible solutions
  • Theorem 3: Convergence of projected gradient descent at $y^\ast$
  • Proposition 3
  • Proposition 4: Theorem II.1 in chen2021cutoff
  • Proposition 5
  • Corollary 1
  • Lemma 1: Smoothness condition
  • ...and 12 more