Sum-of-norms clustering does not separate nearby balls
Alexander Dunlap, Jean-Christophe Mourrat
TL;DR
This work analyzes sum-of-norms clustering through a measure-valued, convex-analytic lens, deriving an exact KKT-based characterization of the minimizer and an exact, local-global description of the resulting clusters. It proves a precise split condition: clusters are contained in radius-$\lambda$ balls around their centroids and different clusters have well-separated centroids, with the centroid measure $\mathcal{M}_{u}(\mu)$ being $\lambda$-shattered while each cluster is $\lambda$-cohesive. The stochastic ball model exposes a brittleness in detecting two nearby clusters: for large dimension, there exists a critical threshold $\lambda_c=\lambda_1(\mu)$ below which at least three clusters appear with high probability, and above which a single cluster tends to dominate, with sharp dimension-dependent bounds via $\gamma_d$. Stability results show that the clustering behavior persists under natural perturbations of the data-generating measure, enabling convergence statements for empirical measures and their clusterings. Collectively, the-paper illuminates fundamental limits of SON clustering for closely spaced clusters, suggests why weighting the fusion term can restore separability in practice, and provides a rigorous foundation for understanding and improving convex clustering in high-dimensional settings.
Abstract
Sum-of-norms clustering is a popular convexification of $K$-means clustering. We show that, if the dataset is made of a large number of independent random variables distributed according to the uniform measure on the union of two disjoint balls of unit radius, and if the balls are sufficiently close to one another, then sum-of-norms clustering will typically fail to recover the decomposition of the dataset into two clusters. As the dimension tends to infinity, this happens even when the distance between the centers of the two balls is taken to be as large as $2\sqrt{2}$. In order to show this, we introduce and analyze a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure. In particular, we state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.
