Table of Contents
Fetching ...

Strong Consistency of Sparse K-means Clustering

Jeungju Kim, Johan Lim

TL;DR

This work establishes strong consistency for sparse K-means clustering in high dimensions when using the squared Euclidean distance, and risk consistency for general distances, by reformulating the problem in a centroid-based empirical risk minimization framework. The authors define population and empirical risks $R(\mathbf{w},A)$ and $R_n(\mathbf{w},A)$, prove risk and estimator convergence under mild assumptions, and connect the analysis to Rademacher complexity and U-statistics to obtain finite-sample-like guarantees. For Euclidean distance, a concrete risk bound $R(\hat{\theta})-R(\theta^*)$ is derived, leading to almost sure convergence $\hat{\theta} \to \theta^*$ as $n,p \to \infty$ with $\log(p)/n \to 0$; for general distances, analogous risk bounds are established with complexity terms $RC$ and $RC_j$, tied to VC-dimension, ensuring consistency under suitable growth rates. The results extend to models with $l_0}$ or $l_1$ penalties and illuminate the limits of recovering cluster structure under different mixture models, while outlining directions for extending clustering-consistency analysis to broader distance notions.

Abstract

In this paper, we study the strong consistency of the sparse K-means clustering for high dimensional data. We prove the consistency in both risk and clustering for the Euclidean distance. We discuss the characterization of the limit of the clustering under some special cases. For the general (non-Euclidean) distance, we prove the consistency in risk. Our result naturally extends to other models with the same objective function but different constraints such as l0 or l1 penalty in recent literature.

Strong Consistency of Sparse K-means Clustering

TL;DR

This work establishes strong consistency for sparse K-means clustering in high dimensions when using the squared Euclidean distance, and risk consistency for general distances, by reformulating the problem in a centroid-based empirical risk minimization framework. The authors define population and empirical risks and , prove risk and estimator convergence under mild assumptions, and connect the analysis to Rademacher complexity and U-statistics to obtain finite-sample-like guarantees. For Euclidean distance, a concrete risk bound is derived, leading to almost sure convergence as with ; for general distances, analogous risk bounds are established with complexity terms and , tied to VC-dimension, ensuring consistency under suitable growth rates. The results extend to models with or penalties and illuminate the limits of recovering cluster structure under different mixture models, while outlining directions for extending clustering-consistency analysis to broader distance notions.

Abstract

In this paper, we study the strong consistency of the sparse K-means clustering for high dimensional data. We prove the consistency in both risk and clustering for the Euclidean distance. We discuss the characterization of the limit of the clustering under some special cases. For the general (non-Euclidean) distance, we prove the consistency in risk. Our result naturally extends to other models with the same objective function but different constraints such as l0 or l1 penalty in recent literature.
Paper Structure (9 sections, 11 theorems, 71 equations)

This paper contains 9 sections, 11 theorems, 71 equations.

Key Result

Lemma 1

The optimal values of orig sparse kmeans and centroid sparse kmeans are the same when $d_{i,i',j}=(X_{ij}-X_{i'j})^2$.

Theorems & Definitions (22)

  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Corollary 1
  • proof
  • Theorem 3
  • Corollary 2
  • proof
  • Theorem 4
  • proof
  • ...and 12 more