Table of Contents
Fetching ...

Spectral Toolkit of Algorithms for Graphs: Technical Report (2)

Peter Macgregor, He Sun

TL;DR

STAG 2.0 extends the open-source graph-analysis toolkit with three scalable components: Euclidean Locality Sensitive Hashing for approximate nearest neighbors, CKNS-based Gaussian Kernel Density Estimation for fast density queries, and an MS-based fast spectral clustering pipeline. The report provides a comprehensive user guide, practical API descriptions, and demonstrations that highlight how CKNS KDE enables efficient similarity graph construction and scalable clustering on large datasets. It discusses design decisions, parameter choices, and performance comparisons, illustrating STAG's applicability to large-scale graph-based data analysis in both C++ and Python. The integrated approach reduces the computational burden of traditional fully connected graphs while preserving clustering structure and providing theoretical guarantees where applicable.

Abstract

Spectral Toolkit of Algorithms for Graphs (STAG) is an open-source library for efficient graph algorithms. This technical report presents the newly implemented component on locality sensitive hashing, kernel density estimation, and fast spectral clustering. The report includes a user's guide to the newly implemented algorithms, experiments and demonstrations of the new functionality, and several technical considerations behind our development.

Spectral Toolkit of Algorithms for Graphs: Technical Report (2)

TL;DR

STAG 2.0 extends the open-source graph-analysis toolkit with three scalable components: Euclidean Locality Sensitive Hashing for approximate nearest neighbors, CKNS-based Gaussian Kernel Density Estimation for fast density queries, and an MS-based fast spectral clustering pipeline. The report provides a comprehensive user guide, practical API descriptions, and demonstrations that highlight how CKNS KDE enables efficient similarity graph construction and scalable clustering on large datasets. It discusses design decisions, parameter choices, and performance comparisons, illustrating STAG's applicability to large-scale graph-based data analysis in both C++ and Python. The integrated approach reduces the computational burden of traditional fully connected graphs while preserving clustering structure and providing theoretical guarantees where applicable.

Abstract

Spectral Toolkit of Algorithms for Graphs (STAG) is an open-source library for efficient graph algorithms. This technical report presents the newly implemented component on locality sensitive hashing, kernel density estimation, and fast spectral clustering. The report includes a user's guide to the newly implemented algorithms, experiments and demonstrations of the new functionality, and several technical considerations behind our development.
Paper Structure (35 sections, 15 equations, 8 figures, 2 tables)

This paper contains 35 sections, 15 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Demonstration of the basic unit of Euclidean LSH: projection onto a random vector. (a) Projecting onto a random vector, with discretisation into hash buckets. (b) Projecting onto multiple random vectors further divides the space. Closer points are more likely to fall into the same hash bucket.
  • Figure 2: The collision probability of two points under a hash function drawn uniformly at random from the hash family $\mathcal{F}$. Figure (a) shows the collision probability with respect to the distance between the two input vectors $u$ and $v$; Figure (b) shows the value of $1 - \left(1 - p(c)^K\right)^L$, which is the collision probability when applying $K \cdot L$ independent hash functions.
  • Figure 3: Three kernel functions which can be used for kernel density estimation.
  • Figure 4: Kernel density estimation provides an estimate of the probability distribution from which the data is drawn. Figure (a) shows the underlying probability distribution; Figure (b) shows the generated data points based on the probability distribution from (a); Figure (c) shows the empirical kernel density estimate of this underlying distribution.
  • Figure 5: The CKNS algorithm first generates several samples of the data with probabilities $1/2, 1/4, \ldots, 1/n$. Then, for any query point $q$ (indicated as the red point shown above), the data points in $\mathcal{L}_i$ are recovered from the data sampled with probability $2^{-i}$.
  • ...and 3 more figures