Improving clustering quality evaluation in noisy Gaussian mixtures
Renato Cordeiro de Amorim, Vladimir Makarenkov
TL;DR
This work tackles the instability of internal clustering validation in noisy, high-dimensional data by introducing Feature Importance Rescaling (FIR). FIR computes per-feature dispersions $D_v$ and derives optimal $\alpha_v$ via $\alpha_v = 1 / \sum_{j=1}^m (D_v / D_j)$ to form a weighted within-cluster objective $WCSS_w = \sum_v \alpha_v^2 D_v$, thereby down-weighting noisy features. The authors prove convexity, a unique solution, scale invariance, and that $\alpha_v$ decreases with dispersion, while noting FIR intentionally violates the richness axiom. Empirically, FIR consistently improves correlations between internal indices (WCSS, ASW, CH, DB) and ground-truth clustering quality (ARI), especially when noise features or cluster overlap are present, indicating more robust unsupervised evaluation. This yields a practical tool to enhance clustering validation in settings lacking labelled data, with potential extensions to other clustering paradigms.
Abstract
Clustering is a well-established technique in machine learning and data analysis, widely used across various domains. Cluster validity indices, such as the Average Silhouette Width, Calinski-Harabasz, and Davies-Bouldin indices, play a crucial role in assessing clustering quality when external ground truth labels are unavailable. However, these measures can be affected by the feature relevance issue, potentially leading to unreliable evaluations in high-dimensional or noisy data sets. We introduce a theoretically grounded Feature Importance Rescaling (FIR) method that enhances the quality of clustering validation by adjusting feature contributions based on their dispersion. It attenuates noise features, clarifies clustering compactness and separation, and thereby aligns clustering validation more closely with the ground truth. Through extensive experiments on synthetic data sets under different configurations, we demonstrate that FIR consistently improves the correlation between the values of cluster validity indices and the ground truth, particularly in settings with noisy or irrelevant features. The results show that FIR increases the robustness of clustering evaluation, reduces variability in performance across different data sets, and remains effective even when clusters exhibit significant overlap. These findings highlight the potential of FIR as a valuable enhancement of clustering validation, making it a practical tool for unsupervised learning tasks where labelled data is unavailable.
