Table of Contents
Fetching ...

Sum-of-norms clustering does not separate nearby balls

Alexander Dunlap, Jean-Christophe Mourrat

TL;DR

This work analyzes sum-of-norms clustering through a measure-valued, convex-analytic lens, deriving an exact KKT-based characterization of the minimizer and an exact, local-global description of the resulting clusters. It proves a precise split condition: clusters are contained in radius-$\lambda$ balls around their centroids and different clusters have well-separated centroids, with the centroid measure $\mathcal{M}_{u}(\mu)$ being $\lambda$-shattered while each cluster is $\lambda$-cohesive. The stochastic ball model exposes a brittleness in detecting two nearby clusters: for large dimension, there exists a critical threshold $\lambda_c=\lambda_1(\mu)$ below which at least three clusters appear with high probability, and above which a single cluster tends to dominate, with sharp dimension-dependent bounds via $\gamma_d$. Stability results show that the clustering behavior persists under natural perturbations of the data-generating measure, enabling convergence statements for empirical measures and their clusterings. Collectively, the-paper illuminates fundamental limits of SON clustering for closely spaced clusters, suggests why weighting the fusion term can restore separability in practice, and provides a rigorous foundation for understanding and improving convex clustering in high-dimensional settings.

Abstract

Sum-of-norms clustering is a popular convexification of $K$-means clustering. We show that, if the dataset is made of a large number of independent random variables distributed according to the uniform measure on the union of two disjoint balls of unit radius, and if the balls are sufficiently close to one another, then sum-of-norms clustering will typically fail to recover the decomposition of the dataset into two clusters. As the dimension tends to infinity, this happens even when the distance between the centers of the two balls is taken to be as large as $2\sqrt{2}$. In order to show this, we introduce and analyze a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure. In particular, we state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.

Sum-of-norms clustering does not separate nearby balls

TL;DR

This work analyzes sum-of-norms clustering through a measure-valued, convex-analytic lens, deriving an exact KKT-based characterization of the minimizer and an exact, local-global description of the resulting clusters. It proves a precise split condition: clusters are contained in radius- balls around their centroids and different clusters have well-separated centroids, with the centroid measure being -shattered while each cluster is -cohesive. The stochastic ball model exposes a brittleness in detecting two nearby clusters: for large dimension, there exists a critical threshold below which at least three clusters appear with high probability, and above which a single cluster tends to dominate, with sharp dimension-dependent bounds via . Stability results show that the clustering behavior persists under natural perturbations of the data-generating measure, enabling convergence statements for empirical measures and their clusterings. Collectively, the-paper illuminates fundamental limits of SON clustering for closely spaced clusters, suggests why weighting the fusion term can restore separability in practice, and provides a rigorous foundation for understanding and improving convex clustering in high-dimensional settings.

Abstract

Sum-of-norms clustering is a popular convexification of -means clustering. We show that, if the dataset is made of a large number of independent random variables distributed according to the uniform measure on the union of two disjoint balls of unit radius, and if the balls are sufficiently close to one another, then sum-of-norms clustering will typically fail to recover the decomposition of the dataset into two clusters. As the dimension tends to infinity, this happens even when the distance between the centers of the two balls is taken to be as large as . In order to show this, we introduce and analyze a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure. In particular, we state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.

Paper Structure

This paper contains 16 sections, 29 theorems, 195 equations, 4 figures.

Key Result

Theorem 1.1

There exists a $\lambda_{\mathrm{c}} \in (0,\infty)$ such that the following holds. Let $r \in [1,\gamma_d)$, $\mu$ be the uniform probability measure on $B_1(-r\mathrm{e}_1) \cup B_1(r\mathrm{e}_1) \subseteq \mathbf{R}^d$, $(X_n)_{n \in \mathbf{N}}$ be independent random variables with law $\mu$, a In fact, we can take $\lambda_{\mathrm{c}} = \lambda_1(\mu)$, with the latter quantity defined in e

Figures (4)

  • Figure 1.1: The output of the clustering algorithm on $N=100$ datapoints divided between the boundaries of three balls, for four values of $\lambda$. The filled circles represent the datapoints $x_n$, and the crosses represent the cluster representatives $y_n$. Each color represents a cluster. All figures in this paper were generated using an implementation (by the present authors) of the algorithm described in JV20. The code is available at https://github.com/ajdunlap/son-clustering-experiments.
  • Figure 1.2: Sum-of-norms clustering of the stochastic ball model with $N=200$ datapoints drawn from $B(-1.05\mathrm{e}_1,1)\cup B(1.05\mathrm{e}_1,1)$. The balls from which the points are drawn are outlined in dotted grey lines. When $\lambda =2.0$, there are many clusters, but when $\lambda$ is slightly larger ($\lambda = 2.15$), there is just one large cluster. \ref{['t.stoch.ball']} tells us that (since $1.05<\gamma_2$), in the limit as $N\to\infty$, there will be no open interval of values of $\lambda$ for which there are exactly two clusters.
  • Figure 3.1: Clustering results for the vertices of two octagons. Vertices assigned to the same cluster are drawn in the same color.
  • Figure 3.2: The number of clusters produced by sum-of-norms clustering run on the measure $\mu$ given by the uniform distribution on $\{x\in\delta\mathbf{Z}^2\mid |x|\le 1\}$, for varying choices of $\lambda$ and $\delta$. Missing values correspond to failures to certify the clustering using the procedure of JV20.

Theorems & Definitions (31)

  • Theorem 1.1
  • Definition 1.2
  • Theorem 1.3
  • Theorem 1.4
  • Definition 1.5
  • Proposition 1.6
  • Theorem 1.7
  • Proposition 1.8
  • Theorem 1.9
  • Theorem 1.10
  • ...and 21 more