Table of Contents
Fetching ...

The elbow statistic: Multiscale clustering statistical significance

Francisco J. Perez-Reche

TL;DR

This work introduces ElbowSig, a framework that formalizes the heuristic ``elbow'' method as a rigorous inferential problem, and derives the asymptotic properties of this null statistic in both large-sample and high-dimensional regimes, characterizing its baseline behavior and stochastic variability.

Abstract

Selecting the number of clusters remains a fundamental challenge in unsupervised learning. Existing criteria typically target a single ``optimal'' partition, often overlooking statistically meaningful structure present at multiple resolutions. We introduce ElbowSig, a framework that formalizes the heuristic ``elbow'' method as a rigorous inferential problem. Our approach centers on a normalized discrete curvature statistic derived from the cluster heterogeneity sequence, which is evaluated against a null distribution of unstructured data. We derive the asymptotic properties of this null statistic in both large-sample and high-dimensional regimes, characterizing its baseline behavior and stochastic variability. As an algorithm-agnostic procedure, ElbowSig requires only the heterogeneity sequence and is compatible with a wide range of clustering methods, including hard, fuzzy, and model-based clustering. Extensive experiments on synthetic and empirical datasets demonstrate that the method maintains appropriate Type-I error control while providing the power to resolve multiscale organizational structures that are typically obscured by single-resolution selection criteria.

The elbow statistic: Multiscale clustering statistical significance

TL;DR

This work introduces ElbowSig, a framework that formalizes the heuristic ``elbow'' method as a rigorous inferential problem, and derives the asymptotic properties of this null statistic in both large-sample and high-dimensional regimes, characterizing its baseline behavior and stochastic variability.

Abstract

Selecting the number of clusters remains a fundamental challenge in unsupervised learning. Existing criteria typically target a single ``optimal'' partition, often overlooking statistically meaningful structure present at multiple resolutions. We introduce ElbowSig, a framework that formalizes the heuristic ``elbow'' method as a rigorous inferential problem. Our approach centers on a normalized discrete curvature statistic derived from the cluster heterogeneity sequence, which is evaluated against a null distribution of unstructured data. We derive the asymptotic properties of this null statistic in both large-sample and high-dimensional regimes, characterizing its baseline behavior and stochastic variability. As an algorithm-agnostic procedure, ElbowSig requires only the heterogeneity sequence and is compatible with a wide range of clustering methods, including hard, fuzzy, and model-based clustering. Extensive experiments on synthetic and empirical datasets demonstrate that the method maintains appropriate Type-I error control while providing the power to resolve multiscale organizational structures that are typically obscured by single-resolution selection criteria.
Paper Structure (17 sections, 14 equations, 3 figures, 5 tables)

This paper contains 17 sections, 14 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Mean value and variance of the elbow statistic $\delta_k^{(r)}$ for unstructured data as a function of the dimension $D$ for agglomerative, k-means, GMM and FCM clustering methods. The expectation and variance are computed from $N_R = 500$ reference datasets of size $N=30$ uniformly distributed on $[0,1]^D$, assuming $k=3$ clusters. The dashed lines indicate the predicted asymptotic trends for large $D$. The dotted line in panel (a) indicates the asymptotic value of $\mathbb{E}[\delta_k^{(r)}]$ for FCM in the limit of large fuzzifier parameter $m$.
  • Figure 2: Clustering of two-dimensional data. (a) Dataset with $N=200$ observations consisting of $M=3$ Gaussian components centred at the locations indicated by crosses, each with spread $\sigma_c = 1$. Solid circles show data points coloured according to their generating component. Grey crosses show one realization of a bounding-box uniform reference dataset. Partitions with $k \in \{1,2,\dots,10\}$ were evaluated using five standard methods for estimating the number of clusters: (b) Calinski--Harabasz, (c) Davies--Bouldin, and (d) silhouette scores as functions of $k$. Vertical dashed lines indicate the value $\hat{k}$ at which each index attains its maximum. (e) Gap statistic as a function of $k$, with vertical lines marking the optimal number of clusters according to the Gap (I) and Gap (II) criteria described in the text. Error bars indicate the standard error estimate $s_k$tibshirani_estimating_2001.
  • Figure 3: Application of ElbowSig to the two-dimensional dataset shown in Fig. \ref{['fig:2D_Example_Traditional']}. (a) Total within-cluster heterogeneity (inertia) as a function of $k$. (b) Observed elbow statistic (circles) together with the $100 (1 - p_{\text{sig}})$ percentile of the $\delta_k$ distribution obtained from unstructured reference datasets (squares) at level $q_1=0.05$. (c) Empirical p-values $p_k$ comparing the observed elbow statistic at each $k$ with its reference distribution. The horizontal dashed line indicates the per-scale significance threshold $p_{\text{sig}}$.