Table of Contents
Fetching ...

CLAMS: A System for Zero-Shot Model Selection for Clustering

Prabhant Singh, Pieter Gijsbers, Murat Onur Yildirim, Elif Ceren Gok, Joaquin Vanschoren

TL;DR

The paper tackles zero-shot model selection for clustering by introducing CLAMS, an AutoML framework for full clustering pipelines, and CLAMS-OT, a meta-learning module that uses entropic Gromov-Wasserstein distances to quantify dataset similarity and transfer the best prior pipeline to a new unlabeled dataset. It formulates a dataset-distance based approach to select algorithms without labels and demonstrates superior performance against baselines using AMI and ROPE analyses on 57 OpenML clustering datasets. Key contributions include an open-source clustering AutoML tool, a scalable GW-LR based similarity metric, and empirical evidence that similarity-aware transfer improves clustering outcomes in unlabeled settings. The work advances AutoML for unsupervised tasks by connecting dataset geometry, meta-learning, and automated pipeline search in a unified framework with practical evaluation and reproducibility.

Abstract

We propose an AutoML system that enables model selection on clustering problems by leveraging optimal transport-based dataset similarity. Our objective is to establish a comprehensive AutoML pipeline for clustering problems and provide recommendations for selecting the most suitable algorithms, thus opening up a new area of AutoML beyond the traditional supervised learning settings. We compare our results against multiple clustering baselines and find that it outperforms all of them, hence demonstrating the utility of similarity-based automated model selection for solving clustering applications.

CLAMS: A System for Zero-Shot Model Selection for Clustering

TL;DR

The paper tackles zero-shot model selection for clustering by introducing CLAMS, an AutoML framework for full clustering pipelines, and CLAMS-OT, a meta-learning module that uses entropic Gromov-Wasserstein distances to quantify dataset similarity and transfer the best prior pipeline to a new unlabeled dataset. It formulates a dataset-distance based approach to select algorithms without labels and demonstrates superior performance against baselines using AMI and ROPE analyses on 57 OpenML clustering datasets. Key contributions include an open-source clustering AutoML tool, a scalable GW-LR based similarity metric, and empirical evidence that similarity-aware transfer improves clustering outcomes in unlabeled settings. The work advances AutoML for unsupervised tasks by connecting dataset geometry, meta-learning, and automated pipeline search in a unified framework with practical evaluation and reproducibility.

Abstract

We propose an AutoML system that enables model selection on clustering problems by leveraging optimal transport-based dataset similarity. Our objective is to establish a comprehensive AutoML pipeline for clustering problems and provide recommendations for selecting the most suitable algorithms, thus opening up a new area of AutoML beyond the traditional supervised learning settings. We compare our results against multiple clustering baselines and find that it outperforms all of them, hence demonstrating the utility of similarity-based automated model selection for solving clustering applications.
Paper Structure (21 sections, 9 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 21 sections, 9 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: CLAMS Overview
  • Figure 2: Overview of CLAMS-OT
  • Figure 3: Bayesian Wilcoxon signed-rank test results of CLAMS-OT vs Baselines with ROPE=0.01, this figure shows the simplex and projections of the posterior for the Bayesian sign-rank test. The closer the distribution is to the bottom left corner, the more likely it is that our method is better.
  • Figure 4: Critical difference diagram of CLAMS-OT vs baselines (Mean Adjusted mutual information score)
  • Figure 5: Heatmap showing dataset similarity, lesser value means more similar