Table of Contents
Fetching ...

Clusterability test for categorical data

Lianyu Hu, Junjie Dong, Mudi Jiang, Yan Liu, Zengyou He

TL;DR

TestCat addresses the problem of assessing clusterability for categorical data by reframing it as a testing problem on attribute associations. It computes a global test statistic by summing the per-pair chi-squared statistics across all attribute pairs and derives an analytical $p$-value under an independence assumption, yielding a statistically interpretable clusterability score. The method is validated on 18 UCI categorical data sets against randomized counterparts (CRDS), demonstrating strong discrimination between clusterable and unclusterable data and outperforming adaptations of numeric-data methods. While the approach is robust and efficient, it relies on independence assumptions and chi-squared approximations, with ongoing work to handle more complex attribute interactions and alternative association tests.

Abstract

The objective of clusterability evaluation is to check whether a clustering structure exists within the data set. As a crucial yet often-overlooked issue in cluster analysis, it is essential to conduct such a test before applying any clustering algorithm. If a data set is unclusterable, any subsequent clustering analysis would not yield valid results. Despite its importance, the majority of existing studies focus on numerical data, leaving the clusterability evaluation issue for categorical data as an open problem. Here we present TestCat, a testing-based approach to assess the clusterability of categorical data in terms of an analytical $p$-value. The key idea underlying TestCat is that clusterable categorical data possess many strongly associated attribute pairs and hence the sum of chi-squared statistics of all attribute pairs is employed as the test statistic for $p$-value calculation. We apply our method to a set of benchmark categorical data sets, showing that TestCat outperforms those solutions based on existing clusterability evaluation methods for numeric data. To the best of our knowledge, our work provides the first way to effectively recognize the clusterability of categorical data in a statistically sound manner.

Clusterability test for categorical data

TL;DR

TestCat addresses the problem of assessing clusterability for categorical data by reframing it as a testing problem on attribute associations. It computes a global test statistic by summing the per-pair chi-squared statistics across all attribute pairs and derives an analytical -value under an independence assumption, yielding a statistically interpretable clusterability score. The method is validated on 18 UCI categorical data sets against randomized counterparts (CRDS), demonstrating strong discrimination between clusterable and unclusterable data and outperforming adaptations of numeric-data methods. While the approach is robust and efficient, it relies on independence assumptions and chi-squared approximations, with ongoing work to handle more complex attribute interactions and alternative association tests.

Abstract

The objective of clusterability evaluation is to check whether a clustering structure exists within the data set. As a crucial yet often-overlooked issue in cluster analysis, it is essential to conduct such a test before applying any clustering algorithm. If a data set is unclusterable, any subsequent clustering analysis would not yield valid results. Despite its importance, the majority of existing studies focus on numerical data, leaving the clusterability evaluation issue for categorical data as an open problem. Here we present TestCat, a testing-based approach to assess the clusterability of categorical data in terms of an analytical -value. The key idea underlying TestCat is that clusterable categorical data possess many strongly associated attribute pairs and hence the sum of chi-squared statistics of all attribute pairs is employed as the test statistic for -value calculation. We apply our method to a set of benchmark categorical data sets, showing that TestCat outperforms those solutions based on existing clusterability evaluation methods for numeric data. To the best of our knowledge, our work provides the first way to effectively recognize the clusterability of categorical data in a statistically sound manner.
Paper Structure (22 sections, 3 theorems, 19 equations, 8 figures, 2 tables)

This paper contains 22 sections, 3 theorems, 19 equations, 8 figures, 2 tables.

Key Result

Lemma 1

Given a $2\times 2$ contingency table with fixed marginal frequencies, if $C_{11}$ exceeds $\lambda=\frac{2C_{1\cdot}+C_{\cdot 1}-C_{\cdot 2}}{4}$, then the $Sep_{norm}(DS)$ is directly proportional to $C_{11}$: $C_{11}>\lambda \to Sep_{norm}(DS)\propto C_{11}$.

Figures (8)

  • Figure 1: Parallel coordinate plots are utilized to display the strong association among neighboring attribute pairs in both the (a) Hayes-Roth data set and its (b) referenced random data. The attribute values are represented by "chess", "sports", "stamps" or "1", "2", "3", "4". The strength of association is determined by standardized residuals (refer to Supplementary Method 1). A standardized residual value exceeding $2$ signifies a strong positive association, while a value less than $-2$ indicates a strong negative association. (c) The ideal clustering extracted from Hayes-Roth data set contains attributes with strong positive associations within each cluster, while those with strong negative associations are distributed across separate clusters. Each of these clusters is represented by specific categories.
  • Figure 2: The TestCat method conducts clusterability analysis on a given categorical data set by providing a $p$-value. (a) A toy categorical data set of four attributes ("A", "B", "C", "D") yields six different attribute pairs that need to be tested. (b) For each attribute pair, such as "A-B", both its observed ($O$) and expected ($E$) $3\times 2$ contingency tables are constructed and used to calculate a chi-squared test statistic. The darker cells or lines indicate a higher frequency of co-occurrence between the two attribute values in the data set. (c) The chi-squared test statistics for all six attribute pairs, along with their respective degrees of freedom (df), are collected and employed to derive a final $p$-value. (d) Under an imposed assumption (refer to Methods in \ref{['methods:chi']}), a single $p$-value can be calculated from the sum of test statistics and its corresponding null distribution. A boxplot of the $p$-values obtained from referenced random data sets is also displayed.
  • Figure 3: Identification results of TestCat and existing methods without dimensionality reduction on 18 UCI data sets (including original data sets and corresponding randomized data sets). (a) The barplots of $p$-values produced by TestCat. To better visualize smaller $p$-values that approach or equal 0, and to distinguish the significance level of 0.01 through the y-axis ($y=0$), we use transformed $p$-values, defined by the formula: $y=\log (p\text{-value}/ 0.01 + 0.000001)-\log (0.000001)$. (b) Dip and Silverman have been applied to distance values obtained from 20 different categorical distance measures (detailed descriptions of all these measures can be found in reference Sulc2022), hence constituting two sets of comparison methods: Dip-dist and Silv-dist. The outcomes of these comparisons are illustrated in heatmap. (c) We evaluate TestCat against Dip-dist and Silv-dist by counting the number of correctly identified ODSs and CRDSs under the significance level of 0.01. The boxplots describe the count distributions of variants of Dip-dist and Silv-dist derived from the 20 distance measures employed. Outliers are denoted in accordance with the specific distance measure used.
  • Figure 4: Count of correctly identified data sets by using TestCat and compared methods. (a) Compared methods via dimensionality reduction (running 101 times for each data set) based on Hamming distance. The resulting median $p$-value from 101 runs is used for determining whether each target data set is clusterable. The experimental results based on Lin1 distance are displayed in Supplementary Figure 2. (b) Categorical-to-Numerical methods via CDC_DR embedding. We implemented the CDC_DR embedding on each categorical data set to generate its numerical representation. Following this transformation, we utilized PCA or SPCA to further condense this numerical data. The experimental results based on CDE embedding are displayed in Supplementary Figure 3.
  • Figure 5: Illustration of clustering structure underlying 8 UCI categorical data sets by using visual assessment. All plots are derived from the Hamming distances between objects in the data sets. Here, for each original data set, we generate its corresponding randomized data set, which undoubtedly should have no clustering structure (detailed procedures for generating the random data are presented in Section \ref{['methods:generation']}). (a) Scatter plots of tSNE and MDS, where different colors/ shapes represent the class labels provided in the original data sets. Note that duplicate objects have been removed before running tSNE and MDS. Both (b) iVAT plots and (c) Dissimilarity plots display the reordered distances obtained through R Package "seriation". Potential clusters can be identified as multiple densely shaded blocks along the diagonal, where each square is large enough to accommodate a sufficient number of objects. The results from applying iVAT plots to the other 10 UCI categorical data sets are displayed in Supplementary Figure 5.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Definition 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 1
  • proof