Estimating Random-Walk Probabilities in Directed Graphs
Christian Bertram, Mads Vestergaard Jensen, Mikkel Thorup, Hanzhi Wang, Shuyi Yan
TL;DR
This work studies α-discounted random walks on directed graphs and the problem of estimating π(s,t), the Personalized PageRank score, across the single-pair, single-source, single-target, and single-node variants under various graph-access query models. It delivers a complete picture by deriving tight lower and upper bounds for all problem variants and query combinations, resolving polynomial gaps in both worst- and average-case settings. A key technical advance is a novel randomized bidirectional framework that combines backward propagation with selective Monte Carlo estimation, resolving the remaining query combination and achieving optimality up to polylog factors. The results clarify how different API query types (IN-SORTED, ADJ, JUMP) impact the complexity of PPR estimation, guiding the design of scalable graph APIs and impactfully informing large-scale graph analytics in practice.
Abstract
We study discounted random walks in directed graphs. In each step, the walk either terminates with a constant probability $α$, or proceeds to a random out-neighbor. Our goal is to estimate the probability $π(s, t)$ that a discounted random walk starting from $s$ terminates at $t$. This probability is also known as the Personalized PageRank (PPR) score, which measures the relevance of $t$ to $s$, for instance, when $s$ and $t$ are web pages on the Internet. We aim to estimate $π(s, t)$ within a constant relative error with constant probability. A variety of algorithms have been developed for several problem variants, such as single-pair, single-source, single-target, and single-node estimation, under both worst-case and average-case settings, and for different combinations of allowed graph queries. However, in many important cases, there remain polynomial gaps between known upper and lower bounds. In this paper, we establish tight bounds for all problem variants and query combinations, closing all existing gaps in both the worst-case and average-case settings. We provide tight (up to logarithmic factors) lower bounds, showing that for all but one query combination, existing algorithms are already optimal. For the remaining case, we design a novel algorithm that matches our new lower bound, thereby achieving optimality. This is the first algorithm to exploit this specific query combination. It uses a new randomized bidirectional framework that combines randomized backward propagation with selective Monte Carlo estimation.
