Problem-oriented AutoML in Clustering

Matheus Camilo da Silva; Gabriel Marques Tavares; Eric Medvet; Sylvio Barbon Junior

Problem-oriented AutoML in Clustering

Matheus Camilo da Silva, Gabriel Marques Tavares, Eric Medvet, Sylvio Barbon Junior

TL;DR

Experimental results demonstrate that PoAC not only outperforms state-of-the-art frameworks on a variety of datasets but also excels in specific tasks such as data visualization, and highlight its ability to dynamically adjust pipeline configurations based on dataset complexity.

Abstract

The Problem-oriented AutoML in Clustering (PoAC) framework introduces a novel, flexible approach to automating clustering tasks by addressing the shortcomings of traditional AutoML solutions. Conventional methods often rely on predefined internal Clustering Validity Indexes (CVIs) and static meta-features, limiting their adaptability and effectiveness across diverse clustering tasks. In contrast, PoAC establishes a dynamic connection between the clustering problem, CVIs, and meta-features, allowing users to customize these components based on the specific context and goals of their task. At its core, PoAC employs a surrogate model trained on a large meta-knowledge base of previous clustering datasets and solutions, enabling it to infer the quality of new clustering pipelines and synthesize optimal solutions for unseen datasets. Unlike many AutoML frameworks that are constrained by fixed evaluation metrics and algorithm sets, PoAC is algorithm-agnostic, adapting seamlessly to different clustering problems without requiring additional data or retraining. Experimental results demonstrate that PoAC not only outperforms state-of-the-art frameworks on a variety of datasets but also excels in specific tasks such as data visualization, and highlight its ability to dynamically adjust pipeline configurations based on dataset complexity.

Problem-oriented AutoML in Clustering

TL;DR

Abstract

Paper Structure (21 sections, 14 equations, 12 figures, 7 tables)

This paper contains 21 sections, 14 equations, 12 figures, 7 tables.

Introduction
Theoretical Background
Related Works
Problem-oriented AutoML in Clustering (PoAC)
Problem statement: surrogate-based PS for clustering
PoAC for the visualization problem
Problem Space Design
Feature Space Mapping
Dataset Meta-features
Internal CVI
Surrogate Modeling
External CVI
Surrogate Model
Function Optimization
Baselines
...and 6 more sections

Figures (12)

Figure 1: Problem-oriented AutoML in Clustering (PoAC) framework composed by problem space design, feature space mapping, surrogate modeling, function optimization and pipeline recommendation.
Figure 2: Correlation between ARI and surrogate model.
Figure 3: Surrogate model's Feature Importance.
Figure 4: Pipeline optimization process for clustering problems within the PoAC framework. The process begins by extracting meta-features ($\mu$) and CVI from clustering datasets. This CVI-related goal ($m$) serves as the target for the surrogate model $f_m$, which is used to predict the quality for pipeline candidates ($P, A, \Lambda$) on new, unseen data ($D$). The surrogate model then optimizes the pipeline, evaluating potential solutions based on their predicted ($m$), using extracted meta-features from the new data and CVI from pipelines to inform the optimization process.
Figure 5: Frameworks sorted ascendingly by DBS' variance.
...and 7 more figures

Problem-oriented AutoML in Clustering

TL;DR

Abstract

Problem-oriented AutoML in Clustering

Authors

TL;DR

Abstract

Table of Contents

Figures (12)