Table of Contents
Fetching ...

A penalized criterion for selecting the number of clusters for K-medians

Antoine Godichon-Baggioni, Sobihan Surendran

TL;DR

This article obtains a suitable penalty shape for their criterion and derives an associated oracle-type inequality and the performance of this approach with different types of K-medians algorithms is compared on a simulation study with other popular techniques.

Abstract

Clustering is a usual unsupervised machine learning technique for grouping the data points into groups based upon similar features. We focus here on unsupervised clustering for contaminated data, i.e in the case where K-medians should be preferred to K-means because of its robustness. More precisely, we concentrate on a common question in clustering: how to chose the number of clusters? The answer proposed here is to consider the choice of the optimal number of clusters as the minimization of a risk function via penalization. In this paper, we obtain a suitable penalty shape for our criterion and derive an associated oracle-type inequality. Finally, the performance of this approach with different types of K-medians algorithms is compared on a simulation study with other popular techniques. All studied algorithms are available in the R package Kmedians on CRAN.

A penalized criterion for selecting the number of clusters for K-medians

TL;DR

This article obtains a suitable penalty shape for their criterion and derives an associated oracle-type inequality and the performance of this approach with different types of K-medians algorithms is compared on a simulation study with other popular techniques.

Abstract

Clustering is a usual unsupervised machine learning technique for grouping the data points into groups based upon similar features. We focus here on unsupervised clustering for contaminated data, i.e in the case where K-medians should be preferred to K-means because of its robustness. More precisely, we concentrate on a common question in clustering: how to chose the number of clusters? The answer proposed here is to consider the choice of the optimal number of clusters as the minimization of a risk function via penalization. In this paper, we obtain a suitable penalty shape for our criterion and derive an associated oracle-type inequality. Finally, the performance of this approach with different types of K-medians algorithms is compared on a simulation study with other popular techniques. All studied algorithms are available in the R package Kmedians on CRAN.
Paper Structure (16 sections, 8 theorems, 72 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 16 sections, 8 theorems, 72 equations, 8 figures, 2 tables, 2 algorithms.

Key Result

Theorem 3.1

Let $X_1, \ldots, X_n$ be random vectors taking values in $\mathbb{R}^d$ with the same law as $X$, and we assume that $\| X \| \le R$ almost surely for some $R > 0$. Define $W$ and $W_n$ as in W and Wn, respectively. Then for all $1 \le k \le n$,

Figures (8)

  • Figure 1: Evolution of $-W_n(\hat{c}_{k})$ with respect to penalty shape: $\sqrt{k/n}$.
  • Figure 2: Evolution of $W_n(\hat{c}_{k})$ (on the left) and $\text{crit}(k)$ (on the right) with respect to $k$.
  • Figure 3: Profiles (on the left) and clustering via K-medians represented on the first two principal components (on the right) without contaminated data.
  • Figure 4: Profiles (on the left) and clustering via K-medians algorithm represented on the first two principal components (on the right) with $5\%$ of contaminated data.
  • Figure 5: Profiles (on the left) and clustering via K-means algorithm represented on the first two principal components (on the right) with $5\%$ of contaminated data.
  • ...and 3 more figures

Theorems & Definitions (12)

  • Theorem 3.1
  • Theorem 3.2
  • Proposition 3.1
  • Lemma 5.1: hoeffding1994probability
  • Lemma 5.2: cesa1999minimax, Proposition 3
  • Lemma 5.3: bartlett1998minimax, Lemma 1
  • Lemma 5.4
  • proof : $\textbf{Proof of the Lemma }\ref{['lem4']}\textbf{ : }$
  • Lemma 5.5: mcdiarmid1989method, massart2007concentration : Theorem 5.3
  • proof
  • ...and 2 more