Table of Contents
Fetching ...

Silhouette-Driven Instance-Weighted $k$-means

Aggelos Semoglou, Aristidis Likas, John Pavlopoulos

TL;DR

K-Sil is introduced, a silhouette-driven $k-means variant that, at each iteration, weights points using a centroid-margin proxy for the silhouette score, emphasizing confidently assigned instances while down-weighting borderline or noisy regions.

Abstract

Clustering is a fundamental unsupervised learning task with applications across a wide range of domains. Popular algorithms such as $k$-means are efficient and widely used, but can be sensitive to outliers, ambiguous boundary points, and heterogeneous cluster geometry, which may distort centroid estimates and yield suboptimal partitions. We introduce K-Sil, a silhouette-driven $k$-means variant that, at each iteration, weights points using a centroid-margin proxy for the silhouette score, emphasizing confidently assigned instances while down-weighting borderline or noisy regions. Centroid updates take the form of a softmax-weighted mean, and an adaptive temperature automatically calibrates the sharpness of the weight distribution using a cluster-balanced, macro-averaged, silhouette criterion. Under standard separation conditions, we establish a local convergence result for the induced weighted centroid updates. Experiments on 15 real-world datasets spanning tabular, biomedical, text, and image representations show consistent gains in internal validation metrics and typical improvements in external validation metrics over $k$-means and competitive instance-weighted baselines.

Silhouette-Driven Instance-Weighted $k$-means

TL;DR

K-Sil is introduced, a silhouette-driven $k-means variant that, at each iteration, weights points using a centroid-margin proxy for the silhouette score, emphasizing confidently assigned instances while down-weighting borderline or noisy regions.

Abstract

Clustering is a fundamental unsupervised learning task with applications across a wide range of domains. Popular algorithms such as -means are efficient and widely used, but can be sensitive to outliers, ambiguous boundary points, and heterogeneous cluster geometry, which may distort centroid estimates and yield suboptimal partitions. We introduce K-Sil, a silhouette-driven -means variant that, at each iteration, weights points using a centroid-margin proxy for the silhouette score, emphasizing confidently assigned instances while down-weighting borderline or noisy regions. Centroid updates take the form of a softmax-weighted mean, and an adaptive temperature automatically calibrates the sharpness of the weight distribution using a cluster-balanced, macro-averaged, silhouette criterion. Under standard separation conditions, we establish a local convergence result for the induced weighted centroid updates. Experiments on 15 real-world datasets spanning tabular, biomedical, text, and image representations show consistent gains in internal validation metrics and typical improvements in external validation metrics over -means and competitive instance-weighted baselines.

Paper Structure

This paper contains 21 sections, 41 equations, 14 figures, 2 tables, 1 algorithm.

Figures (14)

  • Figure 1: Effect of temperature $\tau$ on weights (Eq. \ref{['eq:weights']}), on synthetic data ($k=3$). Left: $\tau = 0.2$ (flatter weighting). Right: $\tau = 2$ (more peaked weighting) ($\uparrow$red, $\downarrow$blue).
  • Figure 2: Hungarian-matched accuracy across top$s_i(\mu)$ percentiles (from a reference $k$-means partition), comparing the original feature space and the UMAP representation on Mip (left) and Vcl (right).
  • Figure 3: Average centroid movement and Pearson correlation between weights across K-Sil iterations $t$ (§\ref{['subsec:eval']}) for Mip, Vcl and Stl. The final iteration (convergence) is marked by $\square$. The remaining datasets (§\ref{['subsec:datasets']}) are shown in Appendix \ref{['app:empirical']} (Figs. \ref{['fig:convergence_app_a']}--\ref{['fig:convergence_app_d']}).
  • Figure 4: Mean ARI and SIL across multiple random initializations and datasets (§\ref{['subsec:datasets']}) as a function of the number of clusters $k\in\{k^\star-5, \ k^\star-4, \dots, \ k^\star, \dots, k^\star+4, \ k^\star+5 \}$.
  • Figure 5: Mean relative change (% $\uparrow$$\downarrow$) in SIL, ARI, NMI of K-Sil over k-means (with $k$-means${++}$ initialization) over 120 independent runs across datasets (§\ref{['subsec:datasets']}).
  • ...and 9 more figures

Theorems & Definitions (5)

  • proof
  • proof
  • proof
  • proof
  • proof