Table of Contents
Fetching ...

Robust Node Affinities via Jaccard-Biased Random Walks and Rank Aggregation

Bastian Pfeifer, Michael G. Schimek

TL;DR

TopKGraphs provides a non-parametric, interpretable, and general-purpose representation of node similarity that can be applied in both network analysis and machine learning workflows, and is suggested to be a versatile and interpretable tool for bridging simple local similarity measures with more complex embedding-based approaches.

Abstract

Estimating node similarity is a fundamental task in network analysis and graph-based machine learning, with applications in clustering, community detection, classification, and recommendation. We propose TopKGraphs, a method based on start-node-anchored random walks that bias transitions toward nodes with structurally similar neighborhoods, measured via Jaccard similarity. Rather than computing stationary distributions, walks are treated as stochastic neighborhood samplers, producing partial node rankings that are aggregated using robust rank aggregation to construct interpretable node-to-node affinity matrices. TopKGraphs provides a non-parametric, interpretable, and general-purpose representation of node similarity that can be applied in both network analysis and machine learning workflows. We evaluate the method on synthetic graphs (stochastic block models, Lancichinetti-Fortunato-Radicchi benchmark graphs), k-nearest-neighbor graphs from tabular datasets, and a curated high-confidence protein-protein interaction network. Across all scenarios, TopKGraphs achieves competitive or superior performance compared to standard similarity measures (Jaccard, Dice), a diffusion-based method (personalized PageRank), and an embedding-based approach (Node2Vec), demonstrating robustness in sparse, noisy, or heterogeneous networks. These results suggest that TopKGraphs is a versatile and interpretable tool for bridging simple local similarity measures with more complex embedding-based approaches, facilitating both data mining and network analysis applications.

Robust Node Affinities via Jaccard-Biased Random Walks and Rank Aggregation

TL;DR

TopKGraphs provides a non-parametric, interpretable, and general-purpose representation of node similarity that can be applied in both network analysis and machine learning workflows, and is suggested to be a versatile and interpretable tool for bridging simple local similarity measures with more complex embedding-based approaches.

Abstract

Estimating node similarity is a fundamental task in network analysis and graph-based machine learning, with applications in clustering, community detection, classification, and recommendation. We propose TopKGraphs, a method based on start-node-anchored random walks that bias transitions toward nodes with structurally similar neighborhoods, measured via Jaccard similarity. Rather than computing stationary distributions, walks are treated as stochastic neighborhood samplers, producing partial node rankings that are aggregated using robust rank aggregation to construct interpretable node-to-node affinity matrices. TopKGraphs provides a non-parametric, interpretable, and general-purpose representation of node similarity that can be applied in both network analysis and machine learning workflows. We evaluate the method on synthetic graphs (stochastic block models, Lancichinetti-Fortunato-Radicchi benchmark graphs), k-nearest-neighbor graphs from tabular datasets, and a curated high-confidence protein-protein interaction network. Across all scenarios, TopKGraphs achieves competitive or superior performance compared to standard similarity measures (Jaccard, Dice), a diffusion-based method (personalized PageRank), and an embedding-based approach (Node2Vec), demonstrating robustness in sparse, noisy, or heterogeneous networks. These results suggest that TopKGraphs is a versatile and interpretable tool for bridging simple local similarity measures with more complex embedding-based approaches, facilitating both data mining and network analysis applications.
Paper Structure (7 sections, 13 equations, 8 figures)

This paper contains 7 sections, 13 equations, 8 figures.

Figures (8)

  • Figure 1: Community detection performance on synthetic stochastic block model (SBM) graphs with varying intra-community density. Boxplots show the distribution of Adjusted Rand Index (ARI) values across 50 simulations. Hierarchical clustering (Ward’s method) was applied to node affinity matrices derived from six approaches: TopKGraphs, Jaccard similarity, Dice similarity, Laplacian embedding, personalized PageRank, and Node2Vec. Synthetic graphs had three equally sized communities (ten nodes each) with fixed low inter-community connection probability (0.05) and intra-community probabilities ranging from 0.10 to 0.50.
  • Figure 2: Community detection performance on synthetic stochastic block model (SBM) graphs with varying inter-community density. Boxplots show the distribution of Adjusted Rand Index (ARI) values across 50 simulations. Hierarchical clustering (Ward’s method) was applied to node affinity matrices derived from six approaches: TopKGraphs, Jaccard similarity, Dice similarity, Laplacian embedding, personalized PageRank, and Node2Vec. Synthetic graphs had three equally sized communities (ten nodes each) with fixed intra-community connection probability of $0.50$ and inter-community probabilities ranging from 0.01 to 0.30.
  • Figure 3: Community detection performance on synthetic LFR benchmark graphs. Boxplots display the distribution of Adjusted Rand Index (ARI) values over 50 simulation runs. In each iteration, an LFR benchmark graph (100 nodes, average degree 5, maximum degree 10) was generated with varying mixing parameter $\mu = [0.03, 0.05, 0.10, 0.20, 0.30]$, fixed degree exponent $\tau_1 = 2$, and fixed community size exponent $\tau_2 = 1.1$ (community sizes between 5 and 50 nodes). Hierarchical clustering (Ward’s method) was applied to node-affinity matrices derived from six approaches: TopKGraphs, Jaccard similarity, Dice similarity, Laplacian embedding, personalized PageRank, and Node2Vec. Clustering performance was evaluated against the planted LFR community structure.
  • Figure 4: Effect of random walk length and number of walks on community detection performance. Mean and standard deviation of Adjusted Rand Index (ARI) values from 50 simulation runs are displayed while subsequently increasing (a) walk length and (b) number of walks. Hierarchical clustering (Ward’s method) was applied to node-affinity representations obtained from TopKGraphs and Node2Vec. (a) LFR benchmark graphs (100 nodes, average degree 5, maximum degree 10, community sizes between 5 and 50 nodes). The mixing parameter was set to $\mu = 0.05$, with degree exponent $\tau_1 = 2$ and community size exponent $\tau_2 = 1.1$. (b) SBM benchmark graphs with three equally sized communities (ten nodes each), fixed intra-community connection probability of $0.50$ and inter-community probabilities of $0.05$.
  • Figure 5: Computational scaling of node-affinity methods on synthetic LFR graphs. Runtime (in seconds) as a function of graph size for six node-affinity approaches: TopKGraphs, Node2Vec, Jaccard similarity, Dice similarity, Laplacian embedding, and personalized PageRank. For each graph size (50–1000 nodes), an LFR benchmark graph was generated (average degree 5, maximum degree 10, mixing parameter $\mu = 0.05$, $\tau_1 = 2$, $\tau_2 = 1.1$), and the computation time required to obtain the corresponding affinity matrix or embedding was recorded. The figure illustrates the computational scaling behavior of the different methods as network size increases.
  • ...and 3 more figures