Categorical data clustering: 25 years beyond K-modes

Tai Dinh; Wong Hauchi; Philippe Fournier-Viger; Daniil Lisik; Minh-Quyet Ha; Hieu-Chi Dam; Van-Nam Huynh

Categorical data clustering: 25 years beyond K-modes

Tai Dinh, Wong Hauchi, Philippe Fournier-Viger, Daniil Lisik, Minh-Quyet Ha, Hieu-Chi Dam, Van-Nam Huynh

TL;DR

This survey traces 25 years of categorical data clustering, from the K-modes inception to contemporary hybrids, subspace, graph-based, and language-model–informed methods. It organizes algorithms into a coherent taxonomy, connects them to data sources and validation metrics, and surveys cross-domain applications. By comparing open-source implementations on standard datasets, it assesses practical performance and reproducibility while outlining persistent challenges like distance definitions, high dimensionality, and scalability. The paper highlights trends toward hybrid models, graph mining, parallel processing, and LLM-assisted labeling as key directions for impactful, interpretable clustering of categorical data. Overall, it provides a rigorous, action-oriented roadmap for researchers and practitioners working with categorical data.

Abstract

The clustering of categorical data is a common and important task in computer science, offering profound implications across a spectrum of applications. Unlike purely numerical data, categorical data often lack inherent ordering as in nominal data, or have varying levels of order as in ordinal data, thus requiring specialized methodologies for efficient organization and analysis. This review provides a comprehensive synthesis of categorical data clustering in the past twenty-five years, starting from the introduction of K-modes. It elucidates the pivotal role of categorical data clustering in diverse fields such as health sciences, natural sciences, social sciences, education, engineering and economics. Practical comparisons are conducted for algorithms having public implementations, highlighting distinguishing clustering methodologies and revealing the performance of recent algorithms on several benchmark categorical datasets. Finally, challenges and opportunities in the field are discussed.

Categorical data clustering: 25 years beyond K-modes

TL;DR

Abstract

Paper Structure (47 sections, 19 equations, 23 figures, 16 tables, 3 algorithms)

This paper contains 47 sections, 19 equations, 23 figures, 16 tables, 3 algorithms.

Introduction
Data sources of the review
Categorical data clustering
Background
Similarity and Dissimilarity measures
Hierarchical versus Partitional Clustering
Ensemble Clustering
Model Based Clustering
Subspace Clustering
Graph Based Clustering
Genetic Based Clustering
Data Stream Clustering
Historical developments
From 1997 to 2007
From 2008 to 2015
...and 32 more sections

Figures (23)

Figure 1: (a) Publications by major publishers over the past 25 years. (b) Papers on categorical data clustering algorithms included in the present survey
Figure 2: Total citations received by selected papers from their respective publication years up to April 2024
Figure 3: Hierarchical Clustering
Figure 4: Hard and Fuzzy Partitional Clustering
Figure 5: Ensemble Clustering
...and 18 more figures

Categorical data clustering: 25 years beyond K-modes

TL;DR

Abstract

Categorical data clustering: 25 years beyond K-modes

Authors

TL;DR

Abstract

Table of Contents

Figures (23)