Table of Contents
Fetching ...

Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning

Jesse S. Ghashti, John R. J. Thompson

TL;DR

It is proved that the DKPS function is a metric and shown that the DKPS metric is a shrinkage method between maximum dissimilarity between all data points to uniform dissimilarity across data points, and improved clustering accuracy for simulated and real-world mixed-type datasets.

Abstract

Distance-based clustering and classification are widely used in various fields to group mixed numeric and categorical data. In many algorithms, a predefined distance measurement is used to cluster data points based on their dissimilarity. While there exist numerous distance-based measures for data with pure numerical attributes and several ordered and unordered categorical metrics, an efficient and accurate distance for mixed-type data that utilizes the continuous and discrete properties simulatenously is an open problem. Many metrics convert numerical attributes to categorical ones or vice versa. They handle the data points as a single attribute type or calculate a distance between each attribute separately and add them up. We propose a metric called KDSUM that uses mixed kernels to measure dissimilarity, with cross-validated optimal bandwidth selection. We demonstrate that KDSUM is a shrinkage method from existing mixed-type metrics to a uniform dissimilarity metric, and improves clustering accuracy when utilized in existing distance-based clustering algorithms on simulated and real-world datasets containing continuous-only, categorical-only, and mixed-type data.

Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning

TL;DR

It is proved that the DKPS function is a metric and shown that the DKPS metric is a shrinkage method between maximum dissimilarity between all data points to uniform dissimilarity across data points, and improved clustering accuracy for simulated and real-world mixed-type datasets.

Abstract

Distance-based clustering and classification are widely used in various fields to group mixed numeric and categorical data. In many algorithms, a predefined distance measurement is used to cluster data points based on their dissimilarity. While there exist numerous distance-based measures for data with pure numerical attributes and several ordered and unordered categorical metrics, an efficient and accurate distance for mixed-type data that utilizes the continuous and discrete properties simulatenously is an open problem. Many metrics convert numerical attributes to categorical ones or vice versa. They handle the data points as a single attribute type or calculate a distance between each attribute separately and add them up. We propose a metric called KDSUM that uses mixed kernels to measure dissimilarity, with cross-validated optimal bandwidth selection. We demonstrate that KDSUM is a shrinkage method from existing mixed-type metrics to a uniform dissimilarity metric, and improves clustering accuracy when utilized in existing distance-based clustering algorithms on simulated and real-world datasets containing continuous-only, categorical-only, and mixed-type data.
Paper Structure (23 sections, 28 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 28 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Variable distribution with respect to cluster assignment for four continuous simulated datasets. From left to right: Sim 1, Sim 2, Sim 3, Sim 4.
  • Figure 2: Variable distribution with respect to cluster assignment for Sim 5. $X_1$ and $X_2$ represent the binary noise variables, and $X_3$, $X_4$ and $X_5$ are the meaningful categorical variables, grouped in a 10 unit interval for ease of interpretation.
  • Figure 3: Boxplots of ARI and CA for KDSUM with hierarchical clustering and average-linkage, compared against competing clustering algorithms for simulated continuous, categorical, and mixed-type data.
  • Figure 4: The left panel is the data generating process of two continuous variables $X_1$ and $X_2$, with $k = 5$ clusters. The middle and right panels depict the CA and ARI, respectively, for the continuous bandwidth grid search, where increments of bandwidths were 0.05 in the range $[0,10]$ for both variables. The red dot on each panel indicates the optimal bandwidth selected via MSCV for the KDSUM metric.
  • Figure 5: The upper three panels are the data generating process with $k = 2$ clusters, with binary noise term ($X_1$), and two unordered categorical variables ($X_2, X_3$). The bottom plot is a parallel coordinates plot for the unordered categorical bandwidth grid search, where increments of bandwidths were 0.05 in the range $[0,0.75]$ for all variables, coloured by the ARI for each possible combination. The red lines indicate the optimal bandwidth determined using maximum-likelihood cross-validation with the KDSUM metric.
  • ...and 3 more figures