Table of Contents
Fetching ...

Fast Clustering of Categorical Big Data

Bipana Thapaliya, Yu Zhuang

TL;DR

This work addresses the initialization sensitivity of $K$-Modes for clustering categorical data by introducing Bisecting K-Modes (BK-Modes), which iteratively bisects clusters using the Two-Modes algorithm and selects the cluster to split by the largest sum of distances, producing $K$ centers that initialize $K$-Modes. The method emphasizes scalability to large datasets and achieves improved clustering quality (lower $SD$) and efficiency (fewer iterations and lower runtime) compared with random initialization and existing density/distance-based initializations. Experimental results on three large real-world datasets demonstrate that BK-Modes provides a reliable, parameter-light initialization that facilitates fast convergence and high-quality clustering in big-data settings. This approach offers practical benefits for practitioners needing robust and scalable initialization for categorical clustering tasks.

Abstract

The K-Modes algorithm, developed for clustering categorical data, is of high algorithmic simplicity but suffers from unreliable performances in clustering quality and clustering efficiency, both heavily influenced by the choice of initial cluster centers. In this paper, we investigate Bisecting K-Modes (BK-Modes), a successive bisecting process to find clusters, in examining how good the cluster centers out of the bisecting process will be when used as initial centers for the K-Modes. The BK-Modes works by splitting a dataset into multiple clusters iteratively with one cluster being chosen and bisected into two clusters in each iteration. We use the sum of distances of data to their cluster centers as the selection metric to choose a cluster to be bisected in each iteration. This iterative process stops when K clusters are produced. The centers of these K clusters are then used as the initial cluster centers for the K-Modes. Experimental studies of the BK-Modes were carried out and were compared against the K-Modes with multiple sets of initial cluster centers as well as the best of the existing methods we found so far in our survey. Experimental results indicated good performances of BK-Modes both in the clustering quality and efficiency for large datasets.

Fast Clustering of Categorical Big Data

TL;DR

This work addresses the initialization sensitivity of -Modes for clustering categorical data by introducing Bisecting K-Modes (BK-Modes), which iteratively bisects clusters using the Two-Modes algorithm and selects the cluster to split by the largest sum of distances, producing centers that initialize -Modes. The method emphasizes scalability to large datasets and achieves improved clustering quality (lower ) and efficiency (fewer iterations and lower runtime) compared with random initialization and existing density/distance-based initializations. Experimental results on three large real-world datasets demonstrate that BK-Modes provides a reliable, parameter-light initialization that facilitates fast convergence and high-quality clustering in big-data settings. This approach offers practical benefits for practitioners needing robust and scalable initialization for categorical clustering tasks.

Abstract

The K-Modes algorithm, developed for clustering categorical data, is of high algorithmic simplicity but suffers from unreliable performances in clustering quality and clustering efficiency, both heavily influenced by the choice of initial cluster centers. In this paper, we investigate Bisecting K-Modes (BK-Modes), a successive bisecting process to find clusters, in examining how good the cluster centers out of the bisecting process will be when used as initial centers for the K-Modes. The BK-Modes works by splitting a dataset into multiple clusters iteratively with one cluster being chosen and bisected into two clusters in each iteration. We use the sum of distances of data to their cluster centers as the selection metric to choose a cluster to be bisected in each iteration. This iterative process stops when K clusters are produced. The centers of these K clusters are then used as the initial cluster centers for the K-Modes. Experimental studies of the BK-Modes were carried out and were compared against the K-Modes with multiple sets of initial cluster centers as well as the best of the existing methods we found so far in our survey. Experimental results indicated good performances of BK-Modes both in the clustering quality and efficiency for large datasets.

Paper Structure

This paper contains 21 sections, 13 equations, 3 figures, 3 tables, 9 algorithms.

Figures (3)

  • Figure 1: Results of three methods on Dataset 1.
  • Figure 2: Results of three methods on Dataset 2.
  • Figure 3: Results of three methods on Dataset 3.