Table of Contents
Fetching ...

A Doubled Adjacency Spectral Embedding Approach to Graph Clustering

Sinyoung Park, Matthew Nunes, Sandipan Roy

Abstract

Spectral clustering is a popular tool in network data analysis, with applications in a variety of scientific application areas. However, many studies have shown that classical spectral clustering does not perform well on certain network structures, particularly core-periphery networks. To improve clustering performance in core-periphery structures, Adjacency Spectral Embedding (ASE) has been introduced, which performs clustering via a network's adjacency matrix instead of the graph Laplacian. Despite its advantages in this setting, the optimal performance of ASE is limited to dense networks, whilst network data observed in practice is often sparse in nature. To address this limitation, we propose a new approach which we term Doubled Adjacency Spectral Embedding (DASE), motivated by the observation that the squared adjacency matrix will leverage the fewer connections in sparse structures more efficiently in clustering applications. Theoretical results establish that the resulting clustering algorithm enjoys good consistency properties when determining sparse community structure. The performance and general applicability of the proposed method is evaluated using extensive simulations on both directed and undirected networks. Our results highlight the improved clustering performance on both sparse and dense networks in the presence of core-periphery structures. We illustrate our proposed technique on real-world employment and transportation datasets.

A Doubled Adjacency Spectral Embedding Approach to Graph Clustering

Abstract

Spectral clustering is a popular tool in network data analysis, with applications in a variety of scientific application areas. However, many studies have shown that classical spectral clustering does not perform well on certain network structures, particularly core-periphery networks. To improve clustering performance in core-periphery structures, Adjacency Spectral Embedding (ASE) has been introduced, which performs clustering via a network's adjacency matrix instead of the graph Laplacian. Despite its advantages in this setting, the optimal performance of ASE is limited to dense networks, whilst network data observed in practice is often sparse in nature. To address this limitation, we propose a new approach which we term Doubled Adjacency Spectral Embedding (DASE), motivated by the observation that the squared adjacency matrix will leverage the fewer connections in sparse structures more efficiently in clustering applications. Theoretical results establish that the resulting clustering algorithm enjoys good consistency properties when determining sparse community structure. The performance and general applicability of the proposed method is evaluated using extensive simulations on both directed and undirected networks. Our results highlight the improved clustering performance on both sparse and dense networks in the presence of core-periphery structures. We illustrate our proposed technique on real-world employment and transportation datasets.

Paper Structure

This paper contains 33 sections, 12 theorems, 101 equations, 14 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Assume that the graph under consideration is directed, and that Assumptions assum:1, assum:2 and assum:3 in Section sec:notations hold. Suppose also that the number of blocks $K$ and the latent vector dimension $d$ are known. Let $\hat{\theta}^{(N)}: \mathcal{V} \mapsto \{ 1, \dots, K \}$ be the est

Figures (14)

  • Figure 1: Block connection probability structure for core-periphery graphs.
  • Figure 2: Comparison of clustering performance on directed graphs using $k$-means in terms of mean accuracy (NMI) over $n_{rep}=50$ simulated graphs when $K=2$ with $\pi = (0.5, 0.5)^\top$: (a) NMI with fixed network size ($N=1,000$) and varying network density; (b) NMI with fixed expected network density ($\alpha = 0.05$) and varying network sizes. In (a) and (b), shaded areas represent the standard deviations of the NMI values over the $n_{rep}$ simulated graphs.
  • Figure 3: Comparison of clustering performance on directed graphs using $k$-means in terms of mean NMI (line) and the corresponding standard deviations (shaded areas) over $n_{rep}=50$ simulated graphs when $K=2$. In the simulation, the network size is fixed at $(N=1,000)$, and the block probability matrix $B$ is fixed, while varying the core group ratio ($\pi_1$) from $0.1$ to $0.9$.
  • Figure 4: Comparison of clustering performance on undirected graphs using $k$-means in terms of mean accuracy (NMI) over $n_{rep}=50$ simulated graphs when $K=2$ with $\pi = (0.5, 0.5)^\top$: (a) NMI with fixed network size ($N=1,000$) and varying network density; (b) NMI with fixed expected network density ($\alpha = 0.05$) and varying network sizes. In (a) and (b), shaded areas represent the standard deviations of the NMI values over the $n_{rep}$ simulated graphs.
  • Figure 5: Comparison of clustering performance on undirected graphs using $k$-means in terms of mean NMI (line) and the corresponding standard deviations (shaded areas) over $n_{rep}=50$ simulated graphs when $K=2$. In the simulation, the network size is fixed at $(N=1,000)$, and the block probability matrix $B$ is fixed, while varying the core group ratio ($\pi_1$) from $0.1$ to $0.9$.
  • ...and 9 more figures

Theorems & Definitions (27)

  • Definition 1: holland1983stochastic
  • Definition 2: young2007random
  • Definition 3: Core-periphery graphs
  • Definition 4: Doubled adjacency matrix, $\tilde{A}$
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Lemma 1: gallagher2024spectral.
  • proof
  • ...and 17 more