Clusterability test for categorical data
Lianyu Hu, Junjie Dong, Mudi Jiang, Yan Liu, Zengyou He
TL;DR
TestCat addresses the problem of assessing clusterability for categorical data by reframing it as a testing problem on attribute associations. It computes a global test statistic by summing the per-pair chi-squared statistics across all attribute pairs and derives an analytical $p$-value under an independence assumption, yielding a statistically interpretable clusterability score. The method is validated on 18 UCI categorical data sets against randomized counterparts (CRDS), demonstrating strong discrimination between clusterable and unclusterable data and outperforming adaptations of numeric-data methods. While the approach is robust and efficient, it relies on independence assumptions and chi-squared approximations, with ongoing work to handle more complex attribute interactions and alternative association tests.
Abstract
The objective of clusterability evaluation is to check whether a clustering structure exists within the data set. As a crucial yet often-overlooked issue in cluster analysis, it is essential to conduct such a test before applying any clustering algorithm. If a data set is unclusterable, any subsequent clustering analysis would not yield valid results. Despite its importance, the majority of existing studies focus on numerical data, leaving the clusterability evaluation issue for categorical data as an open problem. Here we present TestCat, a testing-based approach to assess the clusterability of categorical data in terms of an analytical $p$-value. The key idea underlying TestCat is that clusterable categorical data possess many strongly associated attribute pairs and hence the sum of chi-squared statistics of all attribute pairs is employed as the test statistic for $p$-value calculation. We apply our method to a set of benchmark categorical data sets, showing that TestCat outperforms those solutions based on existing clusterability evaluation methods for numeric data. To the best of our knowledge, our work provides the first way to effectively recognize the clusterability of categorical data in a statistically sound manner.
