A penalized criterion for selecting the number of clusters for K-medians

Antoine Godichon-Baggioni; Sobihan Surendran

A penalized criterion for selecting the number of clusters for K-medians

Antoine Godichon-Baggioni, Sobihan Surendran

TL;DR

This article obtains a suitable penalty shape for their criterion and derives an associated oracle-type inequality and the performance of this approach with different types of K-medians algorithms is compared on a simulation study with other popular techniques.

Abstract

Clustering is a usual unsupervised machine learning technique for grouping the data points into groups based upon similar features. We focus here on unsupervised clustering for contaminated data, i.e in the case where K-medians should be preferred to K-means because of its robustness. More precisely, we concentrate on a common question in clustering: how to chose the number of clusters? The answer proposed here is to consider the choice of the optimal number of clusters as the minimization of a risk function via penalization. In this paper, we obtain a suitable penalty shape for our criterion and derive an associated oracle-type inequality. Finally, the performance of this approach with different types of K-medians algorithms is compared on a simulation study with other popular techniques. All studied algorithms are available in the R package Kmedians on CRAN.

A penalized criterion for selecting the number of clusters for K-medians

TL;DR

Abstract

Paper Structure (16 sections, 8 theorems, 72 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 16 sections, 8 theorems, 72 equations, 8 figures, 2 tables, 2 algorithms.

Introduction
Framework
Geometric Median
K-medians
The choice of k
Simulations
Visualization of results with the package Kmedians
Comparison with Gap Statistic and Silhouette
Contaminated Data in Higher Dimensions
An illustration on real data
Conclusion
Proofs
Some definitions and lemma
Proof of Theorem \ref{['theo1']}
Proof of Theorem \ref{['theo2']}
...and 1 more sections

Key Result

Theorem 3.1

Let $X_1, \ldots, X_n$ be random vectors taking values in $\mathbb{R}^d$ with the same law as $X$, and we assume that $\| X \| \le R$ almost surely for some $R > 0$. Define $W$ and $W_n$ as in W and Wn, respectively. Then for all $1 \le k \le n$,

Figures (8)

Figure 1: Evolution of $-W_n(\hat{c}_{k})$ with respect to penalty shape: $\sqrt{k/n}$.
Figure 2: Evolution of $W_n(\hat{c}_{k})$ (on the left) and $\text{crit}(k)$ (on the right) with respect to $k$.
Figure 3: Profiles (on the left) and clustering via K-medians represented on the first two principal components (on the right) without contaminated data.
Figure 4: Profiles (on the left) and clustering via K-medians algorithm represented on the first two principal components (on the right) with $5\%$ of contaminated data.
Figure 5: Profiles (on the left) and clustering via K-means algorithm represented on the first two principal components (on the right) with $5\%$ of contaminated data.
...and 3 more figures

Theorems & Definitions (12)

Theorem 3.1
Theorem 3.2
Proposition 3.1
Lemma 5.1: hoeffding1994probability
Lemma 5.2: cesa1999minimax, Proposition 3
Lemma 5.3: bartlett1998minimax, Lemma 1
Lemma 5.4
proof : $\textbf{Proof of the Lemma }\ref{['lem4']}\textbf{ : }$
Lemma 5.5: mcdiarmid1989method, massart2007concentration : Theorem 5.3
proof
...and 2 more

A penalized criterion for selecting the number of clusters for K-medians

TL;DR

Abstract

A penalized criterion for selecting the number of clusters for K-medians

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (12)