CADM: Cluster-customized Adaptive Distance Metric for Categorical Data Clustering
Taixi Chen, Yiu-ming Cheung, Yiqun Zhang
TL;DR
This paper tackles the challenge of measuring distances for clustering categorical data when attribute-value distances vary across clusters due to differing distributions. It introduces CADM, a cluster-customized adaptive distance metric that uses cluster-specific value importance (CVI) and rival factors to define cluster-distance components (CVD) and incorporates cluster-aware attribute weighting (CAI) for both nominal and ordinal data, extendable to mixed data. The approach is formalized with a clustering objective $J$ and per-cluster distance definitions, and is evaluated on fourteen datasets where it achieves an average rank of $1.3$ against nine baselines, with ablation showing CVD as the main contributor. The method is reported to be efficient, parameter-free, and highly interpretable, suggesting strong practical impact for real-world categorical and mixed-data clustering.
Abstract
An appropriate distance metric is crucial for categorical data clustering, as the distance between categorical data cannot be directly calculated. However, the distances between attribute values usually vary in different clusters induced by their different distributions, which has not been taken into account, thus leading to unreasonable distance measurement. Therefore, we propose a cluster-customized distance metric for categorical data clustering, which can competitively update distances based on different distributions of attributes in each cluster. In addition, we extend the proposed distance metric to the mixed data that contains both numerical and categorical attributes. Experiments demonstrate the efficacy of the proposed method, i.e., achieving an average ranking of around first in fourteen datasets. The source code is available at https://anonymous.4open.science/r/CADM-47D8
