Table of Contents
Fetching ...

Graphlets correct for the topological information missed by random walks

Sam F. L. Windels, Noel Malod-Dognin, Natasa Przulj

TL;DR

The paper addresses the gap that random walks, while efficient, miss substantial local topology information crucial for downstream tasks. It introduces orbit adjacency, a formalization that counts joint co-occurrences of node pairs on pairs of graphlet orbits, and proves that random walks of length up to $l$ capture only a subset of the orbit adjacencies for up to $k$-node graphlets. To enable practical analysis, it presents GRADCO, which exhaustively computes 28 orbit adjacency matrices for up to four-node graphlets, and defines orbit-adjacency based embeddings via Graphlet-orbit PMI (GOPMI) and RW PMI (RWPMI), enabling principled node representations that capture unseen topology. Empirical results on six real networks show that orbit-adjacency embeddings often outperform random-walk based embeddings, including cases where the best orbit adjacencies are unseen by random walks, highlighting the value of richer topological neighborhood information for node labeling and potentially other tasks. The work points to future directions in scalable integration with sampling-based methods and extensions to directed, weighted, temporal, and hypergraph settings.

Abstract

Random walks are widely used for mining networks due to the computational efficiency of computing them. For instance, graph representation learning learns a d-dimensional embedding space, so that the nodes that tend to co-occur on random walks (a proxy of being in the same network neighborhood) are close in the embedding space. Specific local network topology (i.e., structure) influences the co-occurrence of nodes on random walks, so random walks of limited length capture only partial topological information, hence diminishing the performance of downstream methods. We explicitly capture all topological neighborhood information and improve performance by introducing orbit adjacencies that quantify the adjacencies of two nodes as co-occurring on a given pair of graphlet orbits, which are symmetric positions on graphlets (small, connected, non-isomorphic, induced subgraphs of a large network). Importantly, we mathematically prove that random walks on up to k nodes capture only a subset of all the possible orbit adjacencies for up to k-node graphlets. Furthermore, we enable orbit adjacency-based analysis of networks by developing an efficient GRaphlet-orbit ADjacency COunter (GRADCO), which exhaustively computes all 28 orbit adjacency matrices for up to four-node graphlets. Note that four-node graphlets suffice, because real networks are usually small-world. In large networks on around 20,000 nodes, GRADCOcomputesthe28matricesinminutes. Onsixrealnetworksfromvarious domains, we compare the performance of node-label predictors obtained by using the network embeddings based on our orbit adjacencies to those based on random walks. We find that orbit adjacencies, which include those unseen by random walks, outperform random walk-based adjacencies, demonstrating the importance of the inclusion of the topological neighborhood information that is unseen by random walks.

Graphlets correct for the topological information missed by random walks

TL;DR

The paper addresses the gap that random walks, while efficient, miss substantial local topology information crucial for downstream tasks. It introduces orbit adjacency, a formalization that counts joint co-occurrences of node pairs on pairs of graphlet orbits, and proves that random walks of length up to capture only a subset of the orbit adjacencies for up to -node graphlets. To enable practical analysis, it presents GRADCO, which exhaustively computes 28 orbit adjacency matrices for up to four-node graphlets, and defines orbit-adjacency based embeddings via Graphlet-orbit PMI (GOPMI) and RW PMI (RWPMI), enabling principled node representations that capture unseen topology. Empirical results on six real networks show that orbit-adjacency embeddings often outperform random-walk based embeddings, including cases where the best orbit adjacencies are unseen by random walks, highlighting the value of richer topological neighborhood information for node labeling and potentially other tasks. The work points to future directions in scalable integration with sampling-based methods and extensions to directed, weighted, temporal, and hypergraph settings.

Abstract

Random walks are widely used for mining networks due to the computational efficiency of computing them. For instance, graph representation learning learns a d-dimensional embedding space, so that the nodes that tend to co-occur on random walks (a proxy of being in the same network neighborhood) are close in the embedding space. Specific local network topology (i.e., structure) influences the co-occurrence of nodes on random walks, so random walks of limited length capture only partial topological information, hence diminishing the performance of downstream methods. We explicitly capture all topological neighborhood information and improve performance by introducing orbit adjacencies that quantify the adjacencies of two nodes as co-occurring on a given pair of graphlet orbits, which are symmetric positions on graphlets (small, connected, non-isomorphic, induced subgraphs of a large network). Importantly, we mathematically prove that random walks on up to k nodes capture only a subset of all the possible orbit adjacencies for up to k-node graphlets. Furthermore, we enable orbit adjacency-based analysis of networks by developing an efficient GRaphlet-orbit ADjacency COunter (GRADCO), which exhaustively computes all 28 orbit adjacency matrices for up to four-node graphlets. Note that four-node graphlets suffice, because real networks are usually small-world. In large networks on around 20,000 nodes, GRADCOcomputesthe28matricesinminutes. Onsixrealnetworksfromvarious domains, we compare the performance of node-label predictors obtained by using the network embeddings based on our orbit adjacencies to those based on random walks. We find that orbit adjacencies, which include those unseen by random walks, outperform random walk-based adjacencies, demonstrating the importance of the inclusion of the topological neighborhood information that is unseen by random walks.
Paper Structure (24 sections, 3 theorems, 13 equations, 16 figures, 1 table, 4 algorithms)

This paper contains 24 sections, 3 theorems, 13 equations, 16 figures, 1 table, 4 algorithms.

Key Result

Lemma 1

Let $A$ be the binary adjacency matrix of an unweighted, undirected graph $H$. Then for the $l^{th}$ power of the adjacency matrix, $A^l$, the entry $A^l(i,j)$ is the number of possible unique walks of length $l$ from node $i$ to node $j$ in $H$.

Figures (16)

  • Figure 1: An illustration of graphlets, orbits, graphlet adjacency and orbit adjacency.A): Example network $H$. B): All the graphlets with up to four-nodes, labelled from $G_0$ to $G_8$. The automorphism orbits are indicated by the same shade and labelled from 0 to 14. C): The frequency at which node $a$ occurs on each orbit in the example network H (panel A) and how those counts sum into graphlet counts. For instance, node $a$ occurs on orbit $o_{2}$ twice: once in the path $a$-$b$-$c$ and once in the path $a$-$b$-$e$. It never occurs on orbit $o_{1}$, i.e., the centre of a three-node path. Hence, node $a$ occurs on graphlet $G_1$ twice. D): The graphlet adjacency matrices $A_{G_{0}}$ and $A_{G_{1}}$ for the example network $H$. The off-diagonal elements of $A_{G_{1}}$ correspond to the frequency at which the nodes in the corresponding rows and columns co-occur on graphlet $G_1$ in $H$. For instance, $A_{G_{1}}(a,b)=2$, as $a$ and $b$ co-occur twice on $G_1$: via paths $a$-$b$-$c$ and $a$-$b$-$e$. E): The orbit adjacency matrices $A_{o_{1 \hbox{-} 2}}$ and $A_{o_{1 \hbox{-,-} 1}}$ for the example network $H$. The off-diagonal elements of $A_{o_{1 \hbox{-} 2}}$ correspond to the frequency at which the nodes in the corresponding rows and columns co-occur on graphlet $G_1$ in $H$, with the $row$-node touching orbit 1 and the $column$-node touching orbit 2. For instance, $A_{o_{1 \hbox{-} 2}}(a,b)=2$, as $a$ and $b$ co-occur twice on $G_1$ with node $a$ on orbit 1 and node $b$ on orbit 2: via paths $a$-$b$-$c$ and $a$-$b$-$e$. Analogously, $A_{o_{1 \hbox{-,-} 1}}(a,b)=0$, as $a$ and $b$ never co-occur on $G_1$ with both $a$ and $b$ on orbit 1. Similar to how orbit counts sum into graphlet counts (panel C), graphlet adjacency matrix $A_{G_{1}}$ can be computed as the sum of orbit adjacency matrices: $A_{G_{1}} = A_{o_{1 \hbox{-} 2}} + A_{o_{2 \hbox{-} 1}} + A_{o_{1 \hbox{-,-} 1}}$.
  • Figure 2: The different orbit adjacencies touched by the source and sink nodes in random walks of length 2 and 3. The dashed arrows indicate the edges visited by the random walks.
  • Figure 3: Node-label prediction accuracy for each class label in the Amazon-Computer network. We show for each class label (x-axis) and each model (legend), the node-label prediction accuracy achieved by the best performing underlying adjacency (detailed above each bar), measured using the micro averaged F1 score. In the case of orbit adjacency embeddings, $\circ$ and $\bullet$ on the bar indicate if the corresponding best performing orbit adjacency is seen or unseen by random walks up to length three, respectively.
  • Figure 4: Performance evaluation of the best type of adjacency across six different networks. On the left, for each of the six networks, we show the average rank of the best performing adjacency for each embedding strategy, measured by using the micro averaged F1 score. The error bars represent the 95% confidence interval of the average rank, computed by using bootstrapping. On the right, we show the proportion of times the best performing orbit adjacency is unseen by random walks up to length three, when the orbit adjacency outperforms random walk adjacency and DeepWalk, measured by using the micro averaged F1 score.
  • Figure 5: Node-label prediction accuracy for each class label for our six networks using the micro averaged F1 score. For each of our six networks (top to bottom), we show for each class label (x-axis) and each model (legend), the node-label prediction accuracy achieved by the best performing underlying adjacency (detailed above each bar), measured using the micro averaged F1 score. In the case of orbit adjacency embeddings, a $\circ$ and $\bullet$ on the bar indicate if the corresponding best performing orbit adjacency is seen or unseen by random walks up to length three, respectively.
  • ...and 11 more figures

Theorems & Definitions (4)

  • Lemma 1
  • Theorem 1
  • proof
  • Theorem 2