Graphlets correct for the topological information missed by random walks
Sam F. L. Windels, Noel Malod-Dognin, Natasa Przulj
TL;DR
The paper addresses the gap that random walks, while efficient, miss substantial local topology information crucial for downstream tasks. It introduces orbit adjacency, a formalization that counts joint co-occurrences of node pairs on pairs of graphlet orbits, and proves that random walks of length up to $l$ capture only a subset of the orbit adjacencies for up to $k$-node graphlets. To enable practical analysis, it presents GRADCO, which exhaustively computes 28 orbit adjacency matrices for up to four-node graphlets, and defines orbit-adjacency based embeddings via Graphlet-orbit PMI (GOPMI) and RW PMI (RWPMI), enabling principled node representations that capture unseen topology. Empirical results on six real networks show that orbit-adjacency embeddings often outperform random-walk based embeddings, including cases where the best orbit adjacencies are unseen by random walks, highlighting the value of richer topological neighborhood information for node labeling and potentially other tasks. The work points to future directions in scalable integration with sampling-based methods and extensions to directed, weighted, temporal, and hypergraph settings.
Abstract
Random walks are widely used for mining networks due to the computational efficiency of computing them. For instance, graph representation learning learns a d-dimensional embedding space, so that the nodes that tend to co-occur on random walks (a proxy of being in the same network neighborhood) are close in the embedding space. Specific local network topology (i.e., structure) influences the co-occurrence of nodes on random walks, so random walks of limited length capture only partial topological information, hence diminishing the performance of downstream methods. We explicitly capture all topological neighborhood information and improve performance by introducing orbit adjacencies that quantify the adjacencies of two nodes as co-occurring on a given pair of graphlet orbits, which are symmetric positions on graphlets (small, connected, non-isomorphic, induced subgraphs of a large network). Importantly, we mathematically prove that random walks on up to k nodes capture only a subset of all the possible orbit adjacencies for up to k-node graphlets. Furthermore, we enable orbit adjacency-based analysis of networks by developing an efficient GRaphlet-orbit ADjacency COunter (GRADCO), which exhaustively computes all 28 orbit adjacency matrices for up to four-node graphlets. Note that four-node graphlets suffice, because real networks are usually small-world. In large networks on around 20,000 nodes, GRADCOcomputesthe28matricesinminutes. Onsixrealnetworksfromvarious domains, we compare the performance of node-label predictors obtained by using the network embeddings based on our orbit adjacencies to those based on random walks. We find that orbit adjacencies, which include those unseen by random walks, outperform random walk-based adjacencies, demonstrating the importance of the inclusion of the topological neighborhood information that is unseen by random walks.
