Strong Consistency of Sparse K-means Clustering

Jeungju Kim; Johan Lim

Strong Consistency of Sparse K-means Clustering

Jeungju Kim, Johan Lim

TL;DR

This work establishes strong consistency for sparse K-means clustering in high dimensions when using the squared Euclidean distance, and risk consistency for general distances, by reformulating the problem in a centroid-based empirical risk minimization framework. The authors define population and empirical risks $R(\mathbf{w},A)$ and $R_n(\mathbf{w},A)$, prove risk and estimator convergence under mild assumptions, and connect the analysis to Rademacher complexity and U-statistics to obtain finite-sample-like guarantees. For Euclidean distance, a concrete risk bound $R(\hat{\theta})-R(\theta^*)$ is derived, leading to almost sure convergence $\hat{\theta} \to \theta^*$ as $n,p \to \infty$ with $\log(p)/n \to 0$; for general distances, analogous risk bounds are established with complexity terms $RC$ and $RC_j$, tied to VC-dimension, ensuring consistency under suitable growth rates. The results extend to models with $l_0}$ or $l_1$ penalties and illuminate the limits of recovering cluster structure under different mixture models, while outlining directions for extending clustering-consistency analysis to broader distance notions.

Abstract

In this paper, we study the strong consistency of the sparse K-means clustering for high dimensional data. We prove the consistency in both risk and clustering for the Euclidean distance. We discuss the characterization of the limit of the clustering under some special cases. For the general (non-Euclidean) distance, we prove the consistency in risk. Our result naturally extends to other models with the same objective function but different constraints such as l0 or l1 penalty in recent literature.

Strong Consistency of Sparse K-means Clustering

TL;DR

and

, prove risk and estimator convergence under mild assumptions, and connect the analysis to Rademacher complexity and U-statistics to obtain finite-sample-like guarantees. For Euclidean distance, a concrete risk bound

is derived, leading to almost sure convergence

with

; for general distances, analogous risk bounds are established with complexity terms

and

, tied to VC-dimension, ensuring consistency under suitable growth rates. The results extend to models with

penalties and illuminate the limits of recovering cluster structure under different mixture models, while outlining directions for extending clustering-consistency analysis to broader distance notions.

Abstract

Paper Structure (9 sections, 11 theorems, 71 equations)

This paper contains 9 sections, 11 theorems, 71 equations.

Introduction
Main results
Notations and Assumptions
Consistency for Euclidean distance
Consistency for general distance
Discussion
Appendix: Proofs
Proofs of Euclidean distance
Proofs of general (non-Euclidean) distance

Key Result

Lemma 1

The optimal values of orig sparse kmeans and centroid sparse kmeans are the same when $d_{i,i',j}=(X_{ij}-X_{i'j})^2$.

Theorems & Definitions (22)

Lemma 1
Theorem 1
Theorem 2
Corollary 1
proof
Theorem 3
Corollary 2
proof
Theorem 4
proof
...and 12 more

Strong Consistency of Sparse K-means Clustering

TL;DR

Abstract

Strong Consistency of Sparse K-means Clustering

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (22)