Table of Contents
Fetching ...

Village-Net Clustering: A Rapid approach to Non-linear Unsupervised Clustering of High-Dimensional Data

Aditya Ballal, Esha Datta, Gregory A. DePaul, Erik Carlsson, Ye Chen-Izu, Javier E. López, Leighton T. Izu

TL;DR

Village-Net clustering introduces a two-phase, scalable approach for high-dimensional unsupervised clustering without a predefined cluster count. By first partitioning data into K-Means–generated villages and then applying Walk-Likelihood Community Finder (WLCF) on a weighted village graph, it autonomously infers the optimal number of clusters while maintaining computational efficiency around $O(N k d)$. Empirical results on eight real-world datasets show competitive or superior clustering quality (as measured by NMI/ARI) with significantly faster runtimes than several baselines, particularly for large or complex datasets. The method offers a practical, density-oriented strategy for non-linear clustering in high dimensions, with future work aimed at iteratively refining village representations to further improve accuracy.

Abstract

Clustering large high-dimensional datasets with diverse variable is essential for extracting high-level latent information from these datasets. Here, we developed an unsupervised clustering algorithm, we call "Village-Net". Village-Net is specifically designed to effectively cluster high-dimension data without priori knowledge on the number of existing clusters. The algorithm operates in two phases: first, utilizing K-Means clustering, it divides the dataset into distinct subsets we refer to as "villages". Next, a weighted network is created, with each node representing a village, capturing their proximity relationships. To achieve optimal clustering, we process this network using a community detection algorithm called Walk-likelihood Community Finder (WLCF), a community detection algorithm developed by one of our team members. A salient feature of Village-Net Clustering is its ability to autonomously determine an optimal number of clusters for further analysis based on inherent characteristics of the data. We present extensive benchmarking on extant real-world datasets with known ground-truth labels to showcase its competitive performance, particularly in terms of the normalized mutual information (NMI) score, when compared to other state-of-the-art methods. The algorithm is computationally efficient, boasting a time complexity of O(N*k*d), where N signifies the number of instances, k represents the number of villages and d represents the dimension of the dataset, which makes it well suited for effectively handling large-scale datasets.

Village-Net Clustering: A Rapid approach to Non-linear Unsupervised Clustering of High-Dimensional Data

TL;DR

Village-Net clustering introduces a two-phase, scalable approach for high-dimensional unsupervised clustering without a predefined cluster count. By first partitioning data into K-Means–generated villages and then applying Walk-Likelihood Community Finder (WLCF) on a weighted village graph, it autonomously infers the optimal number of clusters while maintaining computational efficiency around . Empirical results on eight real-world datasets show competitive or superior clustering quality (as measured by NMI/ARI) with significantly faster runtimes than several baselines, particularly for large or complex datasets. The method offers a practical, density-oriented strategy for non-linear clustering in high dimensions, with future work aimed at iteratively refining village representations to further improve accuracy.

Abstract

Clustering large high-dimensional datasets with diverse variable is essential for extracting high-level latent information from these datasets. Here, we developed an unsupervised clustering algorithm, we call "Village-Net". Village-Net is specifically designed to effectively cluster high-dimension data without priori knowledge on the number of existing clusters. The algorithm operates in two phases: first, utilizing K-Means clustering, it divides the dataset into distinct subsets we refer to as "villages". Next, a weighted network is created, with each node representing a village, capturing their proximity relationships. To achieve optimal clustering, we process this network using a community detection algorithm called Walk-likelihood Community Finder (WLCF), a community detection algorithm developed by one of our team members. A salient feature of Village-Net Clustering is its ability to autonomously determine an optimal number of clusters for further analysis based on inherent characteristics of the data. We present extensive benchmarking on extant real-world datasets with known ground-truth labels to showcase its competitive performance, particularly in terms of the normalized mutual information (NMI) score, when compared to other state-of-the-art methods. The algorithm is computationally efficient, boasting a time complexity of O(N*k*d), where N signifies the number of instances, k represents the number of villages and d represents the dimension of the dataset, which makes it well suited for effectively handling large-scale datasets.
Paper Structure (14 sections, 2 equations, 4 figures, 4 tables)

This paper contains 14 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Sequential Stages of Village-Net Clustering
  • Figure 2: Visualization of Village-Net Clustering on Two-Moons dataset
  • Figure 3: Wall Time Analysis of Village-Net Clustering on Various Implementations of the Two-Moons Dataset
  • Figure 4: Comparison of clusters obtained by Village-Net Clustering on different hyperparameters with the ground truth on Digits dataset