Table of Contents
Fetching ...

Mixed Data Clustering Survey and Challenges

Guillaume Guerard, Sonia Djebali

TL;DR

The paper surveys mixed-data clustering under big data conditions, arguing for hierarchical and explainable approaches and introducing a pretopology based clustering method. It benchmarks pretopological and classical methods across real and synthetic mixed data, highlighting strong, consistent performance for Pretopo PaCMAP and PretopoMD. The study provides a comprehensive view of dimensionality reduction choices, clustering strategies, and internal validity metrics in the mixed-data setting, and discusses scalability, feature selection, and visualization challenges. Overall, pretopology emerges as a promising framework for interpretable, scalable clustering of heterogeneous data types in complex domains.

Abstract

The advent of the big data paradigm has transformed how industries manage and analyze information, ushering in an era of unprecedented data volume, velocity, and variety. Within this landscape, mixed-data clustering has become a critical challenge, requiring innovative methods that can effectively exploit heterogeneous data types, including numerical and categorical variables. Traditional clustering techniques, typically designed for homogeneous datasets, often struggle to capture the additional complexity introduced by mixed data, underscoring the need for approaches specifically tailored to this setting. Hierarchical and explainable algorithms are particularly valuable in this context, as they provide structured, interpretable clustering results that support informed decision-making. This paper introduces a clustering method grounded in pretopological spaces. In addition, benchmarking against classical numerical clustering algorithms and existing pretopological approaches yields insights into the performance and effectiveness of the proposed method within the big data paradigm.

Mixed Data Clustering Survey and Challenges

TL;DR

The paper surveys mixed-data clustering under big data conditions, arguing for hierarchical and explainable approaches and introducing a pretopology based clustering method. It benchmarks pretopological and classical methods across real and synthetic mixed data, highlighting strong, consistent performance for Pretopo PaCMAP and PretopoMD. The study provides a comprehensive view of dimensionality reduction choices, clustering strategies, and internal validity metrics in the mixed-data setting, and discusses scalability, feature selection, and visualization challenges. Overall, pretopology emerges as a promising framework for interpretable, scalable clustering of heterogeneous data types in complex domains.

Abstract

The advent of the big data paradigm has transformed how industries manage and analyze information, ushering in an era of unprecedented data volume, velocity, and variety. Within this landscape, mixed-data clustering has become a critical challenge, requiring innovative methods that can effectively exploit heterogeneous data types, including numerical and categorical variables. Traditional clustering techniques, typically designed for homogeneous datasets, often struggle to capture the additional complexity introduced by mixed data, underscoring the need for approaches specifically tailored to this setting. Hierarchical and explainable algorithms are particularly valuable in this context, as they provide structured, interpretable clustering results that support informed decision-making. This paper introduces a clustering method grounded in pretopological spaces. In addition, benchmarking against classical numerical clustering algorithms and existing pretopological approaches yields insights into the performance and effectiveness of the proposed method within the big data paradigm.

Paper Structure

This paper contains 58 sections, 12 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Dimensionality reduction on Palmmer Penguins dataset.
  • Figure 2: Hopkins Statistic and iVAT for every dimension reduction over the Palmer Penguins dataset.
  • Figure 3: Time and Memory usage of the different algorithms, on a base case with 500 individuals, 5 numerical and 5 categorical features.
  • Figure 4: Maximum memory usage depending on the number of individuals
  • Figure 5: Computation time depending on the number of individuals
  • ...and 8 more figures