Spectral Clustering of Categorical and Mixed-type Data via Extra Graph Nodes

Dylan Soemitro; Jeova Farias Sales Rocha Neto

Spectral Clustering of Categorical and Mixed-type Data via Extra Graph Nodes

Dylan Soemitro, Jeova Farias Sales Rocha Neto

TL;DR

This paper introduces SpecMix, a spectral clustering framework that integrates categorical information into clustering by augmenting a base numerical graph with extra nodes representing categories. The approach yields a interpretable NCut objective and, for purely categorical data, a linear-time algorithm via Transfer Cut (OnlyCat). Empirical results on synthetic and real data show SpecMix often outperforms or matches state-of-the-art mixed-data methods while offering favorable runtime, highlighting its practical appeal. The work suggests broad applicability to constrained clustering and data types beyond tabular data, with avenues for automated parameter tuning and extension to other clustering criteria.

Abstract

Clustering data objects into homogeneous groups is one of the most important tasks in data mining. Spectral clustering is arguably one of the most important algorithms for clustering, as it is appealing for its theoretical soundness and is adaptable to many real-world data settings. For example, mixed data, where the data is composed of numerical and categorical features, is typically handled via numerical discretization, dummy coding, or similarity computation that takes into account both data types. This paper explores a more natural way to incorporate both numerical and categorical information into the spectral clustering algorithm, avoiding the need for data preprocessing or the use of sophisticated similarity functions. We propose adding extra nodes corresponding to the different categories the data may belong to and show that it leads to an interpretable clustering objective function. Furthermore, we demonstrate that this simple framework leads to a linear-time spectral clustering algorithm for categorical-only data. Finally, we compare the performance of our algorithms against other related methods and show that it provides a competitive alternative to them in terms of performance and runtime.

Spectral Clustering of Categorical and Mixed-type Data via Extra Graph Nodes

TL;DR

Abstract

Paper Structure (16 sections, 9 equations, 5 figures, 1 table)

This paper contains 16 sections, 9 equations, 5 figures, 1 table.

Introduction
Related Work
Mixed-type Data Clustering
Categorical Data Clustering
Preliminaries
Notation
Spectral Clustering
Proposed Methodology
Graph Construction and Algorithm
Cut Interpretation
Case when dataset is categorical ($R = 0$)
Numerical Experiments
Algorithmic and Experimental Setup
Synthetic Data
Real Data
...and 1 more sections

Figures (5)

Figure 1: The proposed graph construction. Starting from a base graph computed solely using the available numerical variables, we add $t$ extra nodes corresponding to the categories of the available Categorical Variables (CV). Above, we depict how one of the nodes (the one corresponding to $p^{(7)}$) is connected to the extra nodes. In the above example, the point $p^{(7)}$ belongs to the categories represented by the nodes $F_{1, t_1}$, $F_{2, 2}$ and $F_{Q, 1}$.
Figure 2: SpecMix results on synthetic datasets when varying $\lambda$ . In all experiments, $n=1000$ and $Q=3$, and each line is the average of 50 synthetic experiments. All heatmaps share the same $y$-axis. The overall average runtimes are: 0.18s ($\lambda=0$), 0.24s ($\lambda=10$), 0.23s ($\lambda=50$), 0.23s ($\lambda=100$), 0.22s ($\lambda=1000$), 0.08s (OnlyCat).
Figure 3: SpecMix results on synthetic data when varying the number of datapoints. In all experiments, $K=2$ and $Q=3$, and each line in the three left plots is the purity average of 50 synthetic experiments, while the one on the right is the average runtime of all experiments. All purity plots share the same $y$-axis.
Figure 4: Results on synthetic data for the tested methods. In all experiments, $n=1000$ and $Q=3$, and each line is the average of 50 synthetic experiments. All plots share the same $x$ and $y$-axis. The overall average runtimes are: 1.36s ($K$-prototypes), 0.16s (LCA), 1.06s (SpectralCAT), 0.11s (FAMD), 0.23s ($\textsc{SpecMix}\xspace$).
Figure 5: Results of synthetic categorical data for the tested methods. In all experiments, we set $Q = 5$. Each line is the average of 50 synthetic experiments. All plots share the same $x$-axis.

Spectral Clustering of Categorical and Mixed-type Data via Extra Graph Nodes

TL;DR

Abstract

Spectral Clustering of Categorical and Mixed-type Data via Extra Graph Nodes

Authors

TL;DR

Abstract

Table of Contents

Figures (5)