Table of Contents
Fetching ...

Parametrized Power-Iteration Clustering for Directed Graphs

Gwendal Debaussart-Joniec, Harry Sevi, Matthieu Jonckheere, Argyris Kalogeratos

TL;DR

The paper tackles clustering in directed graphs where edge directionality breaks standard diffusion assumptions. It introduces Parametrized Power-Iteration Clustering (ParPIC), an eigen-decomposition-free method that uses a parametrized reversible random-walk operator ${\mathbf{P}}_{(\\nu)}$ and power iterations to obtain a low-dimensional diffusion embedding, with diffusion time ${t}$ selected by an entropy-based criterion ${\mathcal{H}}(t)$. By designing the vertex measure ${\\nu}$ (notably ${\\nu}_{\\gamma}=\\gamma d_{in}+(1-\\gamma)d_{out}$) and avoiding eigen-decomposition, ParPIC achieves competitive clustering accuracy while offering improved scalability, particularly on weakly connected digraphs and graphs with degree heterogeneity. Experimental results across synthetic and real-world digraphs demonstrate ParPIC’s robustness to directionality, outperforming symmetrization- and teleportation-based methods, and matching or exceeding existing power-iteration approaches. The work provides a practical, principled framework for directed graph clustering with automatic diffusion-scale selection and scalable embedding-based clustering.

Abstract

Vertex-level clustering for directed graphs (digraphs) remains challenging as edge directionality breaks the key assumptions underlying popular spectral methods, which also incur the overhead of eigen-decomposition. This paper proposes Parametrized Power-Iteration Clustering (ParPIC), a random-walk-based clustering method for weakly connected digraphs. This builds over the Power-Iteration Clustering paradigm, which uses the rows of the iterated diffusion operator as a data embedding. ParPIC has three important features: the use of parametrized reversible random walk operators, the automatic tuning of the diffusion time, and the efficient truncation of the final embedding, which produces low-dimensional data representations and reduces complexity. Empirical results on synthetic and real-world graphs demonstrate that ParPIC achieves competitive clustering accuracy with improved scalability relative to spectral and teleportation-based methods.

Parametrized Power-Iteration Clustering for Directed Graphs

TL;DR

The paper tackles clustering in directed graphs where edge directionality breaks standard diffusion assumptions. It introduces Parametrized Power-Iteration Clustering (ParPIC), an eigen-decomposition-free method that uses a parametrized reversible random-walk operator and power iterations to obtain a low-dimensional diffusion embedding, with diffusion time selected by an entropy-based criterion . By designing the vertex measure (notably ) and avoiding eigen-decomposition, ParPIC achieves competitive clustering accuracy while offering improved scalability, particularly on weakly connected digraphs and graphs with degree heterogeneity. Experimental results across synthetic and real-world digraphs demonstrate ParPIC’s robustness to directionality, outperforming symmetrization- and teleportation-based methods, and matching or exceeding existing power-iteration approaches. The work provides a practical, principled framework for directed graph clustering with automatic diffusion-scale selection and scalable embedding-based clustering.

Abstract

Vertex-level clustering for directed graphs (digraphs) remains challenging as edge directionality breaks the key assumptions underlying popular spectral methods, which also incur the overhead of eigen-decomposition. This paper proposes Parametrized Power-Iteration Clustering (ParPIC), a random-walk-based clustering method for weakly connected digraphs. This builds over the Power-Iteration Clustering paradigm, which uses the rows of the iterated diffusion operator as a data embedding. ParPIC has three important features: the use of parametrized reversible random walk operators, the automatic tuning of the diffusion time, and the efficient truncation of the final embedding, which produces low-dimensional data representations and reduces complexity. Empirical results on synthetic and real-world graphs demonstrate that ParPIC achieves competitive clustering accuracy with improved scalability relative to spectral and teleportation-based methods.
Paper Structure (28 sections, 2 theorems, 30 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 28 sections, 2 theorems, 30 equations, 11 figures, 5 tables, 1 algorithm.

Key Result

Proposition 3.2

If for any vertex $i$, $\nu(i) > 0$ then the following statements hold: Under these conditions, ${\mathbf{P}}_{(\nu)}$ is ergodic, with $\pi_{\nu}$ being its unique stationary distribution, i.e. $\pi_{\nu} {\mathbf{P}}_{(\nu)} = \pi_{\nu}$seabrook2023tutorial.

Figures (11)

  • Figure 1: Overview of the ParPIC pipeline. Given a digraph ${\mathcal{G}}$ with natural random walk ${\mathbf{P}}$, a parametrized reversible random walk operator ${\mathbf{P}}_{(\nu)}$ is constructed based on a vertex measure $\nu$ (Section \ref{['sec:our_method']}). Power-iterations of ${\mathbf{P}}_{(\nu)}$ are then performed to compute ${\mathbf{P}}_{(\nu)}^t$ at a selected diffusion time $t$ (or an approximation to ${\mathbf{P}}_{(\nu)}^t$ is computed, Section \ref{['sec:time_selection']}). The final data partition is produced by clustering the rows of ${\mathbf{P}}_{(\nu)}^t$, e.g. using $k$-means. This process avoids explicit eigen-decomposition while preserving directional diffusion dynamics for effective clustering.
  • Figure 2: Sensitivity of different methods to cluster-level out-degree heterogeneity. (a) Average performance on $50$ runs, while varying $\rho$ value in the DiSBM model of Eq. \ref{['eq:disbm_rho']}. (b) PIC, S-PIC and ParPIC operators at different $\rho$ values.
  • Figure 3: Scaling of runtime with graph size ($3$-cluster C-P DiSBM). ParPIC, with the default approximation of the iterated P-RW operator (${\mathbf{P}}_{(\nu)}^t$) compared to ParPIC with the full computation of the P-RW operator (any typical PIC variant shares this complexity) and the classical Spectral Clustering. The proposed method demonstrates significantly better scalability.
  • Figure 4: Experiments on the Chain DiSBM. (a) Natural (${\mathbf{P}}$), parametrized (${\mathbf{P}}_{(\nu)}$) and symmetrized (${\mathbf{P}}_\textrm{sym}$) random walk operators on the DiSBM Chain model (Eq. \ref{['eq:disbm_chain_rho']}), according to different flow strengths. (b) Clustering sensitivity to flow strength, the proposed ParPIC remains stable across varying $\rho$ values, while variants of PIC significantly degrade as the flow increases.
  • Figure 5: Additional experiments on the Core-Periphery DiSBM. (a)-(b) Clustering performance (AMI) of ParPIC and S-PIC on the Core-Periphery DiSBM, according to the size of the $1$st block and it's out degree flow (Eq. \ref{['eq:disbm_rho_m1']}). (c) Clustering performance (AMI) of different methods when varying the number of clusters in a core-periphery structure (\ref{['eq:disbm_cp_nclust']}).
  • ...and 6 more figures

Theorems & Definitions (5)

  • Definition 3.1: Parametrized random walk (P-RW) operator
  • Proposition 3.2: Impact of $\nu$ on the P-RW operator
  • Definition 3.3: Parametrized diffusion distance
  • Definition 3.4: Parametrized diffusion map
  • Proposition 3.5: Behavior of ${\mathcal{H}}(t)$