Table of Contents
Fetching ...

Break the Tie: Learning Cluster-Customized Category Relationships for Categorical Data Clustering

Mingjie Zhao, Zhanpei Huang, Yang Lu, Mengke Li, Yiqun Zhang, Weifeng Su, Yiu-ming Cheung

TL;DR

This work tackles the poorly defined category relationships in categorical data clustering by introducing DISC, a framework that jointly learns cluster assignments, cluster centers, and cluster-specific category relationships. By modeling value-level relations as graphs and extracting relation trees via minimum spanning trees, DISC derives a clustering-customized subspace distance that is Euclidean-compatible, enabling seamless extension to mixed numerical-categorical data. Extensive experiments on 12 real datasets show DISC achieving superior clustering accuracy and compactness, with strong convergence guarantees and favorable computational efficiency. The approach offers a principled, task-driven distance metric for categorical clustering that adapts to subspace-specific cluster structures and scales to large, high-dimensional data.

Abstract

Categorical attributes with qualitative values are ubiquitous in cluster analysis of real datasets. Unlike the Euclidean distance of numerical attributes, the categorical attributes lack well-defined relationships of their possible values (also called categories interchangeably), which hampers the exploration of compact categorical data clusters. Although most attempts are made for developing appropriate distance metrics, they typically assume a fixed topological relationship between categories when learning distance metrics, which limits their adaptability to varying cluster structures and often leads to suboptimal clustering performance. This paper, therefore, breaks the intrinsic relationship tie of attribute categories and learns customized distance metrics suitable for flexibly and accurately revealing various cluster distributions. As a result, the fitting ability of the clustering algorithm is significantly enhanced, benefiting from the learnable category relationships. Moreover, the learned category relationships are proved to be Euclidean distance metric-compatible, enabling a seamless extension to mixed datasets that include both numerical and categorical attributes. Comparative experiments on 12 real benchmark datasets with significance tests show the superior clustering accuracy of the proposed method with an average ranking of 1.25, which is significantly higher than the 5.21 ranking of the current best-performing method.

Break the Tie: Learning Cluster-Customized Category Relationships for Categorical Data Clustering

TL;DR

This work tackles the poorly defined category relationships in categorical data clustering by introducing DISC, a framework that jointly learns cluster assignments, cluster centers, and cluster-specific category relationships. By modeling value-level relations as graphs and extracting relation trees via minimum spanning trees, DISC derives a clustering-customized subspace distance that is Euclidean-compatible, enabling seamless extension to mixed numerical-categorical data. Extensive experiments on 12 real datasets show DISC achieving superior clustering accuracy and compactness, with strong convergence guarantees and favorable computational efficiency. The approach offers a principled, task-driven distance metric for categorical clustering that adapts to subspace-specific cluster structures and scales to large, high-dimensional data.

Abstract

Categorical attributes with qualitative values are ubiquitous in cluster analysis of real datasets. Unlike the Euclidean distance of numerical attributes, the categorical attributes lack well-defined relationships of their possible values (also called categories interchangeably), which hampers the exploration of compact categorical data clusters. Although most attempts are made for developing appropriate distance metrics, they typically assume a fixed topological relationship between categories when learning distance metrics, which limits their adaptability to varying cluster structures and often leads to suboptimal clustering performance. This paper, therefore, breaks the intrinsic relationship tie of attribute categories and learns customized distance metrics suitable for flexibly and accurately revealing various cluster distributions. As a result, the fitting ability of the clustering algorithm is significantly enhanced, benefiting from the learnable category relationships. Moreover, the learned category relationships are proved to be Euclidean distance metric-compatible, enabling a seamless extension to mixed datasets that include both numerical and categorical attributes. Comparative experiments on 12 real benchmark datasets with significance tests show the superior clustering accuracy of the proposed method with an average ranking of 1.25, which is significantly higher than the 5.21 ranking of the current best-performing method.

Paper Structure

This paper contains 21 sections, 12 theorems, 14 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

The relation tree $\mathcal{T}_{j,r}$ defines a deterministic and Euclidean-compatible distance metric over the categorical attribute values.

Figures (4)

  • Figure 1: Clustering accuracy of our methods and three different distance measurement strategies using $k$-modes: Customized, Uniform, and Weighted on four datasets CA, NU, AP and TA in Table \ref{['tb:statistics']}. Additionally, Hamming is also included as a baseline. $k$ indicates the number of clusters. Our method outperforms other strategies on these datasets. The "Customized" strategy achieved better results than "Uniform" and "Weight" because it assigns customized category relationships to different clusters.
  • Figure 2: Convergence curves of DISC on different datasets.
  • Figure 3: Execution time (y-axis) on synthetic datasets with different numbers of samples $n$ and attributes $l$ (x-axis).
  • Figure 4: t-SNE visualization of the DT dataset.

Theorems & Definitions (26)

  • Theorem 1
  • proof
  • Remark 1: Determinism and Euclidean Compatibility of the Inferred Relation Tree
  • Remark 2: Generalized Attribute Weighting through Relation Tree
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Lemma 1
  • proof
  • ...and 16 more