Table of Contents
Fetching ...

Careful Seeding for k-Medois Clustering with Incremental k-Means++ Initialization

Difei Cheng, Yunfeng Zhang, Ruinan Jin

TL;DR

This work tackles the sensitivity of k-medoids to initialization by introducing INCKPP, an incremental, nonparametric k-means++-based seeding that adds one center per stage and uses a probabilistic distance-based rule to maximize separation, followed by an FKM refinement. To address computational inefficiency, the authors propose INCKPP${}_{sample}$, which performs a fast pre-search on a data subset and then refines with FKM on the full set, achieving $O(t_1 n_1 k^2 + t_2 n k)$ complexity for medoid updates. Unlike INCKM, INCKPP eliminates the need for a preset stretch factor and demonstrates improved performance on imbalanced and complex data distributions across extensive synthetic and real-world experiments. The results show consistent gains in min-SE and often fewer iterations, establishing INCKPP and its fast variant as robust, efficient alternatives for k-medoids clustering with broad applicability. The work also hints at future directions combining these clustering techniques with generative models such as GANs to further enhance unsupervised learning tasks.

Abstract

K-medoids clustering is a popular variant of k-means clustering and widely used in pattern recognition and machine learning. A main drawback of k-medoids clustering is that an improper initialization can cause it to get trapped in local optima. An improved k-medoids clustering algorithm, called INCKM algorithm, which is the first to apply incremental initialization to k-medoids clustering, was recently proposed to overcome this drawback. The INCKM algorithm requires the construction of a subset of candidate medoids determined by one hyperparameter for initialization, and meanwhile, it always fails when dealing with imbalanced datasets with an incorrect hyperparameter selection. In this paper, we propose a novel k-medoids clustering algorithm, called incremental k-means++ (INCKPP) algorithm, which initializes with a novel incremental manner, attempting to optimally add one new cluster center at each stage through a nonparametric and stochastic k-means++ initialization. The INCKPP algorithm overcomes the difficulty of hyperparameter selection in the INCKM algorithm, improves the clustering performance, and can deal with imbalanced datasets well. However, the INCKPP algorithm is not computationally efficient enough. To deal with this, we further propose an improved INCKPP algorithm, called INCKPPsample algorithm, which improves the clustering efficiency while maintaining the clustering performance of the INCKPP algorithm. Extensive results from experiments on both synthetic and real-world datasets, including imbalanced datasets, illustrate that the proposed algorithms outperforms than the other compared algorithms.

Careful Seeding for k-Medois Clustering with Incremental k-Means++ Initialization

TL;DR

This work tackles the sensitivity of k-medoids to initialization by introducing INCKPP, an incremental, nonparametric k-means++-based seeding that adds one center per stage and uses a probabilistic distance-based rule to maximize separation, followed by an FKM refinement. To address computational inefficiency, the authors propose INCKPP, which performs a fast pre-search on a data subset and then refines with FKM on the full set, achieving complexity for medoid updates. Unlike INCKM, INCKPP eliminates the need for a preset stretch factor and demonstrates improved performance on imbalanced and complex data distributions across extensive synthetic and real-world experiments. The results show consistent gains in min-SE and often fewer iterations, establishing INCKPP and its fast variant as robust, efficient alternatives for k-medoids clustering with broad applicability. The work also hints at future directions combining these clustering techniques with generative models such as GANs to further enhance unsupervised learning tasks.

Abstract

K-medoids clustering is a popular variant of k-means clustering and widely used in pattern recognition and machine learning. A main drawback of k-medoids clustering is that an improper initialization can cause it to get trapped in local optima. An improved k-medoids clustering algorithm, called INCKM algorithm, which is the first to apply incremental initialization to k-medoids clustering, was recently proposed to overcome this drawback. The INCKM algorithm requires the construction of a subset of candidate medoids determined by one hyperparameter for initialization, and meanwhile, it always fails when dealing with imbalanced datasets with an incorrect hyperparameter selection. In this paper, we propose a novel k-medoids clustering algorithm, called incremental k-means++ (INCKPP) algorithm, which initializes with a novel incremental manner, attempting to optimally add one new cluster center at each stage through a nonparametric and stochastic k-means++ initialization. The INCKPP algorithm overcomes the difficulty of hyperparameter selection in the INCKM algorithm, improves the clustering performance, and can deal with imbalanced datasets well. However, the INCKPP algorithm is not computationally efficient enough. To deal with this, we further propose an improved INCKPP algorithm, called INCKPPsample algorithm, which improves the clustering efficiency while maintaining the clustering performance of the INCKPP algorithm. Extensive results from experiments on both synthetic and real-world datasets, including imbalanced datasets, illustrate that the proposed algorithms outperforms than the other compared algorithms.
Paper Structure (21 sections, 4 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 4 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: The clustering result on the imbalance${}_{2}$ data set. INCKM gets an incorrect clustering result since it initializes two cluster centers in one class, leading to the fact that one original class is divided into two classes. Note that INCKPP gets the correct result.
  • Figure 2: The comparisons on the synthetic datasets with different $p$, where $p$ is the randomly sampled percentage of the dataset in the pre-search procedure. The values in the figures are the minimum of the sum of errors (min-SE) obtained by each compared algorithm within the CPU time used by INCKPP${}_{sample}$ running $N$ times for different $p$.
  • Figure 3: The comparisons on the synthetic datasets for different $N$, where $N$ is the running times of INCKPP$_{sample}$ algorithm. The values in the table are the minimum of the sum of errors (min-SE) obtained by each compared algorithm within the CPU time used by INCKPP${}_{sample}$ running $N$ times with $p=10$.
  • Figure 4: The comparisons on the real datasets with different $p$, where $p$ and the values in the figures are the same as those in Figure \ref{['fig:synthetic-p']}.
  • Figure 5: The comparisons on the real datasets for different $N$, where $N$ and the values in the figures are the same as those in Figure \ref{['fig:synthetic-N']}.