Table of Contents
Fetching ...

Explaining AutoClustering: Uncovering Meta-Feature Contribution in AutoML for Clustering

Matheus Camilo da Silva, Leonardo Arrighi, Ana Carolina Lorena, Sylvio Barbon Junior

TL;DR

This work tackles the lack of transparency in AutoClustering by delivering a unified taxonomy of meta-features across 22 frameworks and a two-level explainability approach. Global explanations via Decision Predicate Graphs reveal which meta-features structurally drive meta-model decisions, while local SHAP attributions illuminate instance-level feature contributions to specific clustering recommendations. An explainability-driven ablation demonstrates that a small core of meta-features suffices for most predictive power, enabling large reductions in feature extraction cost with limited accuracy loss. Together, these findings provide practical guidelines for building more interpretable and cost-efficient AutoML systems for clustering, including bias diagnostics and improved meta-feature engineering. The work lays groundwork for auditing and refining auto clustering pipelines, with implications for reliability and deployment in real-world unsupervised learning tasks.

Abstract

AutoClustering methods aim to automate unsupervised learning tasks, including algorithm selection (AS), hyperparameter optimization (HPO), and pipeline synthesis (PS), by often leveraging meta-learning over dataset meta-features. While these systems often achieve strong performance, their recommendations are often difficult to justify: the influence of dataset meta-features on algorithm and hyperparameter choices is typically not exposed, limiting reliability, bias diagnostics, and efficient meta-feature engineering. This limits reliability and diagnostic insight for further improvements. In this work, we investigate the explainability of the meta-models in AutoClustering. We first review 22 existing methods and organize their meta-features into a structured taxonomy. We then apply a global explainability technique (i.e., Decision Predicate Graphs) to assess feature importance within meta-models from selected frameworks. Finally, we use local explainability tools such as SHAP (SHapley Additive exPlanations) to analyse specific clustering decisions. Our findings highlight consistent patterns in meta-feature relevance, identify structural weaknesses in current meta-learning strategies that can distort recommendations, and provide actionable guidance for more interpretable Automated Machine Learning (AutoML) design. This study therefore offers a practical foundation for increasing decision transparency in unsupervised learning automation.

Explaining AutoClustering: Uncovering Meta-Feature Contribution in AutoML for Clustering

TL;DR

This work tackles the lack of transparency in AutoClustering by delivering a unified taxonomy of meta-features across 22 frameworks and a two-level explainability approach. Global explanations via Decision Predicate Graphs reveal which meta-features structurally drive meta-model decisions, while local SHAP attributions illuminate instance-level feature contributions to specific clustering recommendations. An explainability-driven ablation demonstrates that a small core of meta-features suffices for most predictive power, enabling large reductions in feature extraction cost with limited accuracy loss. Together, these findings provide practical guidelines for building more interpretable and cost-efficient AutoML systems for clustering, including bias diagnostics and improved meta-feature engineering. The work lays groundwork for auditing and refining auto clustering pipelines, with implications for reliability and deployment in real-world unsupervised learning tasks.

Abstract

AutoClustering methods aim to automate unsupervised learning tasks, including algorithm selection (AS), hyperparameter optimization (HPO), and pipeline synthesis (PS), by often leveraging meta-learning over dataset meta-features. While these systems often achieve strong performance, their recommendations are often difficult to justify: the influence of dataset meta-features on algorithm and hyperparameter choices is typically not exposed, limiting reliability, bias diagnostics, and efficient meta-feature engineering. This limits reliability and diagnostic insight for further improvements. In this work, we investigate the explainability of the meta-models in AutoClustering. We first review 22 existing methods and organize their meta-features into a structured taxonomy. We then apply a global explainability technique (i.e., Decision Predicate Graphs) to assess feature importance within meta-models from selected frameworks. Finally, we use local explainability tools such as SHAP (SHapley Additive exPlanations) to analyse specific clustering decisions. Our findings highlight consistent patterns in meta-feature relevance, identify structural weaknesses in current meta-learning strategies that can distort recommendations, and provide actionable guidance for more interpretable Automated Machine Learning (AutoML) design. This study therefore offers a practical foundation for increasing decision transparency in unsupervised learning automation.
Paper Structure (45 sections, 6 equations, 6 figures, 8 tables)

This paper contains 45 sections, 6 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Heatmap showing the distribution of meta-feature families across various AutoClustering frameworks. The color intensity corresponds to the number of meta-features in each family for each framework. The frameworks are sorted ascending by publication date.
  • Figure 2: Comparison of top and bottom meta-feature predicates per AutoClustering framework (AutoClust, AutoCluster, ML2DAC, PoAC) based on LRC. Top predicates (a) highlight features most embedded in the information flow, while bottom predicates (b) reflect less influential or potentially redundant aspects. Bars are color-coded by meta-feature category.
  • Figure 3: SHAP summary plot for ML2DAC showing the average contribution of each meta-feature to the model’s CVI recommendation. Colors denote the predicted CVI (CH, DBCV, SIL, COP).
  • Figure 4: Cohort-level SHAP summary plot analysing PoAC's recommended clustering pipelines across 25 datasets. Each point represents a dataset/pipeline meta-feature value (color: red=high, blue=low) and its impact on the prediction (x-axis: SHAP value). Meta-features are ranked by mean absolute SHAP value.
  • Figure 5: SHAP explanations for two PoAC-predicted pipelines on the thy dataset, representing low and high clustering quality.
  • ...and 1 more figures