Table of Contents
Fetching ...

Faithful Density-Peaks Clustering via Matrix Computations on MPI Parallelization System

Ji Xu, Tianlong Xiao, Jinye Yang, Panpan Zhu

TL;DR

This paper tackles the scalability of Density Peaks Clustering (DP) for large and non-Euclidean data. It introduces FaithPDP, an MPI-based parallel DP that uses matrix-based distance computations and an inverse leading-node policy to preserve exact DP results while achieving near-linear time and linear memory. It demonstrates through extensive experiments that FaithPDP outperforms state-of-the-art scalable DP approaches in accuracy and maintains competitive efficiency, with applicability to non-Euclidean graphs. The work broadens DP applicability to big data and non-Euclidean domains, and the code is publicly available at its GitHub repository.

Abstract

Density peaks clustering (DP) has the ability of detecting clusters of arbitrary shape and clustering non-Euclidean space data, but its quadratic complexity in both computing and storage makes it difficult to scale for big data. Various approaches have been proposed in this regard, including MapReduce based distribution computing, multi-core parallelism, presentation transformation (e.g., kd-tree, Z-value), granular computing, and so forth. However, most of these existing methods face two limitations. One is their target datasets are mostly constrained to be in Euclidian space, the other is they emphasize only on local neighbors while ignoring global data distribution due to restriction to cut-off kernel when computing density. To address the two issues, we present a faithful and parallel DP method that makes use of two types of vector-like distance matrices and an inverse leading-node-finding policy. The method is implemented on a message passing interface (MPI) system. Extensive experiments showed that our method is capable of clustering non-Euclidean data such as in community detection, while outperforming the state-of-the-art counterpart methods in accuracy when clustering large Euclidean data. Our code is publicly available at https://github.com/alanxuji/FaithPDP.

Faithful Density-Peaks Clustering via Matrix Computations on MPI Parallelization System

TL;DR

This paper tackles the scalability of Density Peaks Clustering (DP) for large and non-Euclidean data. It introduces FaithPDP, an MPI-based parallel DP that uses matrix-based distance computations and an inverse leading-node policy to preserve exact DP results while achieving near-linear time and linear memory. It demonstrates through extensive experiments that FaithPDP outperforms state-of-the-art scalable DP approaches in accuracy and maintains competitive efficiency, with applicability to non-Euclidean graphs. The work broadens DP applicability to big data and non-Euclidean domains, and the code is publicly available at its GitHub repository.

Abstract

Density peaks clustering (DP) has the ability of detecting clusters of arbitrary shape and clustering non-Euclidean space data, but its quadratic complexity in both computing and storage makes it difficult to scale for big data. Various approaches have been proposed in this regard, including MapReduce based distribution computing, multi-core parallelism, presentation transformation (e.g., kd-tree, Z-value), granular computing, and so forth. However, most of these existing methods face two limitations. One is their target datasets are mostly constrained to be in Euclidian space, the other is they emphasize only on local neighbors while ignoring global data distribution due to restriction to cut-off kernel when computing density. To address the two issues, we present a faithful and parallel DP method that makes use of two types of vector-like distance matrices and an inverse leading-node-finding policy. The method is implemented on a message passing interface (MPI) system. Extensive experiments showed that our method is capable of clustering non-Euclidean data such as in community detection, while outperforming the state-of-the-art counterpart methods in accuracy when clustering large Euclidean data. Our code is publicly available at https://github.com/alanxuji/FaithPDP.
Paper Structure (11 sections, 4 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 11 sections, 4 equations, 3 figures, 2 tables, 2 algorithms.

Figures (3)

  • Figure 1: The steps of computing the key vectors in FaithPDP. (a) Compute the distances of a part of the samples against all data using matrix formulation (so the computations of tall distance matrix ${D}^{\mathcal{K}}$ and $\boldsymbol \rho$ are parallelizable). (b) The segment of density vector and the block of tall distance matrix are computed based on a wide distance matrix. (c) Most of the depending data and depending distance are computed via inverse density-distance condition, and those data points that cannot find depending data are identified as mini centers for further processing (parallelizable). (d) Compute the wide distance matrix for mini centers to decide the remaining $\mu$ and $\delta$ (centralized). So far, all the three key vectors for DP are worked out.
  • Figure 2: Running time comparisons.
  • Figure 3: The empirical evaluations of six comparative methods and our proposed method on the 5sprial50K dataset (generated by "FiveSpiralData.py").