Table of Contents
Fetching ...

Refining Filter Global Feature Weighting for Fully-Unsupervised Clustering

Fabian Galis, Darian Onchis

TL;DR

This work tackles the challenge of feature relevance in unsupervised clustering by introducing a SHAP-based global feature weighting (FW) strategy. It adapts SHAP values, computed via a surrogate model trained on initial pseudo-labels, to derive per-feature weights that reweight data before re-clustering; SHAP weights can be combined with traditional FW methods in ensembles, enabling flexible, data-driven weighting. Across five datasets (Iris, Wine, Breast cancer, Digits, Vehicle Silhouettes) and four clustering algorithms (k-means, Ward, HDBSCAN, GMM), SHAP-based FW frequently matches or surpasses standard FW methods and yields notable gains in certain ensembles (e.g., SHAP with $L_p$ or with mRMR). The approach offers a practical, explainable tool for unsupervised learning, though it introduces computational overhead through the surrogate-model step and relies on pseudo-label quality.

Abstract

In the context of unsupervised learning, effective clustering plays a vital role in revealing patterns and insights from unlabeled data. However, the success of clustering algorithms often depends on the relevance and contribution of features, which can differ between various datasets. This paper explores feature weighting for clustering and presents new weighting strategies, including methods based on SHAP (SHapley Additive exPlanations), a technique commonly used for providing explainability in various supervised machine learning tasks. By taking advantage of SHAP values in a way other than just to gain explainability, we use them to weight features and ultimately improve the clustering process itself in unsupervised scenarios. Our empirical evaluations across five benchmark datasets and clustering methods demonstrate that feature weighting based on SHAP can enhance unsupervised clustering quality, achieving up to a 22.69\% improvement over other weighting methods (from 0.586 to 0.719 in terms of the Adjusted Rand Index). Additionally, these situations where the weighted data boosts the results are highlighted and thoroughly explored, offering insight for practical applications.

Refining Filter Global Feature Weighting for Fully-Unsupervised Clustering

TL;DR

This work tackles the challenge of feature relevance in unsupervised clustering by introducing a SHAP-based global feature weighting (FW) strategy. It adapts SHAP values, computed via a surrogate model trained on initial pseudo-labels, to derive per-feature weights that reweight data before re-clustering; SHAP weights can be combined with traditional FW methods in ensembles, enabling flexible, data-driven weighting. Across five datasets (Iris, Wine, Breast cancer, Digits, Vehicle Silhouettes) and four clustering algorithms (k-means, Ward, HDBSCAN, GMM), SHAP-based FW frequently matches or surpasses standard FW methods and yields notable gains in certain ensembles (e.g., SHAP with or with mRMR). The approach offers a practical, explainable tool for unsupervised learning, though it introduces computational overhead through the surrogate-model step and relies on pseudo-label quality.

Abstract

In the context of unsupervised learning, effective clustering plays a vital role in revealing patterns and insights from unlabeled data. However, the success of clustering algorithms often depends on the relevance and contribution of features, which can differ between various datasets. This paper explores feature weighting for clustering and presents new weighting strategies, including methods based on SHAP (SHapley Additive exPlanations), a technique commonly used for providing explainability in various supervised machine learning tasks. By taking advantage of SHAP values in a way other than just to gain explainability, we use them to weight features and ultimately improve the clustering process itself in unsupervised scenarios. Our empirical evaluations across five benchmark datasets and clustering methods demonstrate that feature weighting based on SHAP can enhance unsupervised clustering quality, achieving up to a 22.69\% improvement over other weighting methods (from 0.586 to 0.719 in terms of the Adjusted Rand Index). Additionally, these situations where the weighted data boosts the results are highlighted and thoroughly explored, offering insight for practical applications.

Paper Structure

This paper contains 17 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Flowchart of the employed FW methodology
  • Figure 2: HDBSCAN metrics across five FW methods on the IRIS dataset.
  • Figure 3: HDBSCAN metrics across three FW methods on the Wine dataset.
  • Figure 4: HDBSCAN metrics on four FW methods for the Breast cancer dataset.