Table of Contents
Fetching ...

A New Index for Clustering Evaluation Based on Density Estimation

Gangli Liu

TL;DR

This work tackles the challenge of internal clustering validation by introducing a density-estimation based index that blends two sub-indices: an Ambiguous Index $I_a$ and a Similarity Index $I_s$, combined as $I = \delta I_a + (1 - \delta) I_s$ with $\delta \in [0,1]$. Each sub-index relies on per-cluster kernel density estimates, defining cluster territories and likelihood-based similarity to capture both ambiguity and cohesion within clusters. The approach is evaluated on 145 datasets against six established internal indices, showing that the new index significantly improves ranking accuracy, especially after bandwidth optimization and algorithmic refinements. The results suggest practical gains for internal clustering validation, with potential extensions to higher dimensions and alternative density estimators.

Abstract

A new index for internal evaluation of clustering is introduced. The index is defined as a mixture of two sub-indices. The first sub-index $ I_a $ is called the Ambiguous Index; the second sub-index $ I_s $ is called the Similarity Index. Calculation of the two sub-indices is based on density estimation to each cluster of a partition of the data. An experiment is conducted to test the performance of the new index, and compared with six other internal clustering evaluation indices -- Calinski-Harabasz index, Silhouette coefficient, Davies-Bouldin index, CDbw, DBCV, and VIASCKDE, on a set of 145 datasets. The result shows the new index significantly improves other internal clustering evaluation indices.

A New Index for Clustering Evaluation Based on Density Estimation

TL;DR

This work tackles the challenge of internal clustering validation by introducing a density-estimation based index that blends two sub-indices: an Ambiguous Index and a Similarity Index , combined as with . Each sub-index relies on per-cluster kernel density estimates, defining cluster territories and likelihood-based similarity to capture both ambiguity and cohesion within clusters. The approach is evaluated on 145 datasets against six established internal indices, showing that the new index significantly improves ranking accuracy, especially after bandwidth optimization and algorithmic refinements. The results suggest practical gains for internal clustering validation, with potential extensions to higher dimensions and alternative density estimators.

Abstract

A new index for internal evaluation of clustering is introduced. The index is defined as a mixture of two sub-indices. The first sub-index is called the Ambiguous Index; the second sub-index is called the Similarity Index. Calculation of the two sub-indices is based on density estimation to each cluster of a partition of the data. An experiment is conducted to test the performance of the new index, and compared with six other internal clustering evaluation indices -- Calinski-Harabasz index, Silhouette coefficient, Davies-Bouldin index, CDbw, DBCV, and VIASCKDE, on a set of 145 datasets. The result shows the new index significantly improves other internal clustering evaluation indices.
Paper Structure (31 sections, 16 equations, 30 figures, 5 tables)

This paper contains 31 sections, 16 equations, 30 figures, 5 tables.

Figures (30)

  • Figure 1: A dataset
  • Figure 2: A partition of the dataset
  • Figure 3: Ambiguous Points of the partition
  • Figure 4: Mixture of the two sub-indices works
  • Figure 5: Result of one dataset
  • ...and 25 more figures

Theorems & Definitions (7)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Definition 3.4
  • Definition 3.5
  • Definition 6.1
  • Definition 6.2