Table of Contents
Fetching ...

CADM: Cluster-customized Adaptive Distance Metric for Categorical Data Clustering

Taixi Chen, Yiu-ming Cheung, Yiqun Zhang

TL;DR

This paper tackles the challenge of measuring distances for clustering categorical data when attribute-value distances vary across clusters due to differing distributions. It introduces CADM, a cluster-customized adaptive distance metric that uses cluster-specific value importance (CVI) and rival factors to define cluster-distance components (CVD) and incorporates cluster-aware attribute weighting (CAI) for both nominal and ordinal data, extendable to mixed data. The approach is formalized with a clustering objective $J$ and per-cluster distance definitions, and is evaluated on fourteen datasets where it achieves an average rank of $1.3$ against nine baselines, with ablation showing CVD as the main contributor. The method is reported to be efficient, parameter-free, and highly interpretable, suggesting strong practical impact for real-world categorical and mixed-data clustering.

Abstract

An appropriate distance metric is crucial for categorical data clustering, as the distance between categorical data cannot be directly calculated. However, the distances between attribute values usually vary in different clusters induced by their different distributions, which has not been taken into account, thus leading to unreasonable distance measurement. Therefore, we propose a cluster-customized distance metric for categorical data clustering, which can competitively update distances based on different distributions of attributes in each cluster. In addition, we extend the proposed distance metric to the mixed data that contains both numerical and categorical attributes. Experiments demonstrate the efficacy of the proposed method, i.e., achieving an average ranking of around first in fourteen datasets. The source code is available at https://anonymous.4open.science/r/CADM-47D8

CADM: Cluster-customized Adaptive Distance Metric for Categorical Data Clustering

TL;DR

This paper tackles the challenge of measuring distances for clustering categorical data when attribute-value distances vary across clusters due to differing distributions. It introduces CADM, a cluster-customized adaptive distance metric that uses cluster-specific value importance (CVI) and rival factors to define cluster-distance components (CVD) and incorporates cluster-aware attribute weighting (CAI) for both nominal and ordinal data, extendable to mixed data. The approach is formalized with a clustering objective and per-cluster distance definitions, and is evaluated on fourteen datasets where it achieves an average rank of against nine baselines, with ablation showing CVD as the main contributor. The method is reported to be efficient, parameter-free, and highly interpretable, suggesting strong practical impact for real-world categorical and mixed-data clustering.

Abstract

An appropriate distance metric is crucial for categorical data clustering, as the distance between categorical data cannot be directly calculated. However, the distances between attribute values usually vary in different clusters induced by their different distributions, which has not been taken into account, thus leading to unreasonable distance measurement. Therefore, we propose a cluster-customized distance metric for categorical data clustering, which can competitively update distances based on different distributions of attributes in each cluster. In addition, we extend the proposed distance metric to the mixed data that contains both numerical and categorical attributes. Experiments demonstrate the efficacy of the proposed method, i.e., achieving an average ranking of around first in fourteen datasets. The source code is available at https://anonymous.4open.science/r/CADM-47D8

Paper Structure

This paper contains 6 sections, 8 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Framework of attribute value distance measurement
  • Figure 2: (a) efficiency test on three large datasets. (b) Wilcoxon signed rank test in fourteen datasets. (c) and (d) demonstrate the ablation study in the categorical and mixed datasets.

Theorems & Definitions (2)

  • Remark 1
  • Remark 2