Table of Contents
Fetching ...

Hierarchical Clustering using Reversible Binary Cellular Automata for High-Dimensional Data

Baby C. J., Kamalika Bhattacharjee

TL;DR

The paper tackles high-dimensional clustering by harnessing the cyclic structure of reversible binary cellular automata (CA). It introduces a three-stage hierarchical CA-based scheme: Stage 1 encodes data with frequency-based binary representations and performs initial clustering on vertical splits using a chosen CA rule; Stage 2 uses median-based sorting and Gray-code indexing to merge and compress clusters while applying a second rule, with recursion as needed; Stage 3 finalizes clustering via median-gap analysis to obtain exactly k_c clusters. A principled rule selection process based on information propagation and cycle structure yields a compact set of effective CA rules, enabling quadratic-time performance under practical conditions. Empirical results on standard datasets show competitive clustering quality against hierarchical, k-means, and Birch methods, with the method offering scalable performance for high-dimensional data and applicability across diverse domains. The work also provides a GitHub implementation, highlighting the practical potential of CA-based hierarchical clustering in real-world tasks.

Abstract

This work proposes a hierarchical clustering algorithm for high-dimensional datasets using the cyclic space of reversible finite cellular automata. In cellular automaton (CA) based clustering, if two objects belong to the same cycle, they are closely related and considered as part of the same cluster. However, if a high-dimensional dataset is clustered using the cycles of one CA, closely related objects may belong to different cycles. This paper identifies the relationship between objects in two different cycles based on the median of all elements in each cycle so that they can be grouped in the next stage. Further, to minimize the number of intermediate clusters which in turn reduces the computational cost, a rule selection strategy is taken to find the best rules based on information propagation and cycle structure. After encoding the dataset using frequency-based encoding such that the consecutive data elements maintain a minimum hamming distance in encoded form, our proposed clustering algorithm iterates over three stages to finally cluster the data elements into the desired number of clusters given by user. This algorithm can be applied to various fields, including healthcare, sports, chemical research, agriculture, etc. When verified over standard benchmark datasets with various performance metrics, our algorithm is at par with the existing algorithms with quadratic time complexity.

Hierarchical Clustering using Reversible Binary Cellular Automata for High-Dimensional Data

TL;DR

The paper tackles high-dimensional clustering by harnessing the cyclic structure of reversible binary cellular automata (CA). It introduces a three-stage hierarchical CA-based scheme: Stage 1 encodes data with frequency-based binary representations and performs initial clustering on vertical splits using a chosen CA rule; Stage 2 uses median-based sorting and Gray-code indexing to merge and compress clusters while applying a second rule, with recursion as needed; Stage 3 finalizes clustering via median-gap analysis to obtain exactly k_c clusters. A principled rule selection process based on information propagation and cycle structure yields a compact set of effective CA rules, enabling quadratic-time performance under practical conditions. Empirical results on standard datasets show competitive clustering quality against hierarchical, k-means, and Birch methods, with the method offering scalable performance for high-dimensional data and applicability across diverse domains. The work also provides a GitHub implementation, highlighting the practical potential of CA-based hierarchical clustering in real-world tasks.

Abstract

This work proposes a hierarchical clustering algorithm for high-dimensional datasets using the cyclic space of reversible finite cellular automata. In cellular automaton (CA) based clustering, if two objects belong to the same cycle, they are closely related and considered as part of the same cluster. However, if a high-dimensional dataset is clustered using the cycles of one CA, closely related objects may belong to different cycles. This paper identifies the relationship between objects in two different cycles based on the median of all elements in each cycle so that they can be grouped in the next stage. Further, to minimize the number of intermediate clusters which in turn reduces the computational cost, a rule selection strategy is taken to find the best rules based on information propagation and cycle structure. After encoding the dataset using frequency-based encoding such that the consecutive data elements maintain a minimum hamming distance in encoded form, our proposed clustering algorithm iterates over three stages to finally cluster the data elements into the desired number of clusters given by user. This algorithm can be applied to various fields, including healthcare, sports, chemical research, agriculture, etc. When verified over standard benchmark datasets with various performance metrics, our algorithm is at par with the existing algorithms with quadratic time complexity.
Paper Structure (15 sections, 5 equations, 6 figures, 11 tables, 3 algorithms)

This paper contains 15 sections, 5 equations, 6 figures, 11 tables, 3 algorithms.

Figures (6)

  • Figure 1: Evolution of a $5$-cell reversible CA $267422991$
  • Figure 2: Vertical data split
  • Figure 3: Stage 1 - Initial clustering by applying rule $R_1$ to each vertical split.
  • Figure 4: Stage 2 - Clustering using the median of each cycle and applying rule $R_2$
  • Figure 5: Example for Stage 3: Three clusters are created based on two maximum median gaps at the cycle index at 2 and 4
  • ...and 1 more figures

Theorems & Definitions (7)

  • Definition 1
  • Definition 2
  • Definition 3
  • Example 1
  • Example 2
  • Example 3
  • Example 4