Table of Contents
Fetching ...

ACTGNN: Assessment of Clustering Tendency with Synthetically-Trained Graph Neural Networks

Yiran Luo, Evangelos E. Papalexakis

TL;DR

ACTGNN tackles the challenge of assessing clustering tendency in noisy, high-dimensional data by learning from synthetic datasets. It builds graphs with LSH-based node features and edge features from multiple similarity measures, and uses a 5-layer Graph Convolutional Network to classify whether a dataset contains a k-means clustering structure. Across extensive synthetic and real-world experiments, ACTGNN outperforms Hopkins Statistic- and Silhouette-based baselines, demonstrating strong robustness to dimensionality and noise and clear generalization from synthetic to real data. This work provides a scalable, automated alternative to traditional methods, with potential to improve clustering tendency assessment in diverse data analysis pipelines.

Abstract

Determining clustering tendency in datasets is a fundamental but challenging task, especially in noisy or high-dimensional settings where traditional methods, such as the Hopkins Statistic and Visual Assessment of Tendency (VAT), often struggle to produce reliable results. In this paper, we propose ACTGNN, a graph-based framework designed to assess clustering tendency by leveraging graph representations of data. Node features are constructed using Locality-Sensitive Hashing (LSH), which captures local neighborhood information, while edge features incorporate multiple similarity metrics, such as the Radial Basis Function (RBF) kernel, to model pairwise relationships. A Graph Neural Network (GNN) is trained exclusively on synthetic datasets, enabling robust learning of clustering structures under controlled conditions. Extensive experiments demonstrate that ACTGNN significantly outperforms baseline methods on both synthetic and real-world datasets, exhibiting superior performance in detecting faint clustering structures, even in high-dimensional or noisy data. Our results highlight the generalizability and effectiveness of the proposed approach, making it a promising tool for robust clustering tendency assessment.

ACTGNN: Assessment of Clustering Tendency with Synthetically-Trained Graph Neural Networks

TL;DR

ACTGNN tackles the challenge of assessing clustering tendency in noisy, high-dimensional data by learning from synthetic datasets. It builds graphs with LSH-based node features and edge features from multiple similarity measures, and uses a 5-layer Graph Convolutional Network to classify whether a dataset contains a k-means clustering structure. Across extensive synthetic and real-world experiments, ACTGNN outperforms Hopkins Statistic- and Silhouette-based baselines, demonstrating strong robustness to dimensionality and noise and clear generalization from synthetic to real data. This work provides a scalable, automated alternative to traditional methods, with potential to improve clustering tendency assessment in diverse data analysis pipelines.

Abstract

Determining clustering tendency in datasets is a fundamental but challenging task, especially in noisy or high-dimensional settings where traditional methods, such as the Hopkins Statistic and Visual Assessment of Tendency (VAT), often struggle to produce reliable results. In this paper, we propose ACTGNN, a graph-based framework designed to assess clustering tendency by leveraging graph representations of data. Node features are constructed using Locality-Sensitive Hashing (LSH), which captures local neighborhood information, while edge features incorporate multiple similarity metrics, such as the Radial Basis Function (RBF) kernel, to model pairwise relationships. A Graph Neural Network (GNN) is trained exclusively on synthetic datasets, enabling robust learning of clustering structures under controlled conditions. Extensive experiments demonstrate that ACTGNN significantly outperforms baseline methods on both synthetic and real-world datasets, exhibiting superior performance in detecting faint clustering structures, even in high-dimensional or noisy data. Our results highlight the generalizability and effectiveness of the proposed approach, making it a promising tool for robust clustering tendency assessment.

Paper Structure

This paper contains 20 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the proposed ACTGNN framework for clustering tendency assessment. The process includes transforming raw data into a graph representation by constructing node and edge features, followed by binary classification using a graph neural network.
  • Figure 2: Performance comparison of the ACTGNN, Hopkins Statistic, and K-means with Silhouette Score on synthetic datasets of different dimensions. The horizontal red dashed line represents the ACTGNN's performance.
  • Figure 3: Performance comparison of ACTGNN, Hopkins Statistic, and K-means with Silhouette Score under two experimental variants using the MNIST dataset. The first row in each figure shows the raw scores for the two baseline methods, while the second row presents binary predictions as the percentage of structured data increases.
  • Figure 4: Heatmap of testing accuracy for different edge feature strategies and percentages of nearest neighbors connected. The RBF kernel with moderate $\sigma$ values (2 or 5) and 50%--60% neighbor connections achieves the highest accuracy.