Fast Approximate CoSimRanks via Random Projections
Renchi Yang, Xiaokui Xiao
TL;DR
This work tackles the expensive problem of all-pairs CoSimRank computation by introducing RPCS, a randomized algorithm that projects the $n\times n$ random-walk matrix into $n\times d$ using Johnson-Lindenstrauss-type random projections. By iteratively accumulating low-rank factors via updates $\mathbf{H}^{(k)}=\sqrt{c}\,\mathbf{P}\mathbf{H}^{(k-1)}$ and $\widehat{\mathbf{S}}=\mathbf{I}+\sum_k \mathbf{H}^{(k)}(\mathbf{H}^{(k)})^{\top}$, RPCS achieves an $ε$-accurate approximation for all entries with high probability, while reducing the per-iteration cost to $O(n^2d)$ and overall time to $O\left(\min\left\{\frac{n^2\ln n}{ε^2}\ln\frac{1}{ε},\,n^3\ln\frac{1}{ε}\right\}\right)$. The method is backed by a theoretical error bound via inner-product preservation and a procedure to select a projection dimension $d$ and parameter $δ$ to optimize runtime. Empirical results on six real graphs demonstrate substantial speedups over state-of-the-art methods, enabling ε-approximate all-pairs CoSimRank queries on large datasets such as a million-edge Twitter graph on a single commodity server.
Abstract
Given a graph $G$ with $n$ nodes and two nodes $u,v\in G$, the {\em CoSimRank} value $s(u,v)$ quantifies the similarity between $u$ and $v$ based on graph topology. Compared to SimRank, CoSimRank is shown to be more accurate and effective in many real-world applications, including synonym expansion, lexicon extraction, and entity relatedness in knowledge graphs. The computation of all pairwise CoSimRanks in $G$ is highly expensive and challenging. Existing solutions all focus on devising approximate algorithms for the computation of all pairwise CoSimRanks. To attain a desired absolute accuracy guarantee $ε$, the state-of-the-art approximate algorithm for computing all pairwise CoSimRanks requires $O(n^3\log_2(\ln(\frac{1}ε)))$ time, which is prohibitively expensive even though $ε$ is large. In this paper, we propose \rsim, a fast randomized algorithm for computing all pairwise CoSimRank values. The basic idea of \rsim is to approximate the $n\times n$ matrix multiplications in CoSimRank computation via random projection. Theoretically, \rsim runs in $O(\frac{n^2\ln(n)}{ε^2}\ln(\frac{1}ε))$ time and meanwhile ensures an absolute error of at most $ε$ in each CoSimRank value in $G$ with a high probability. Extensive experiments using six real graphs demonstrate that \rsim is more than orders of magnitude faster than the state of the art. In particular, on a million-edge Twitter graph, \rsim answers the $ε$-approximate ($ε=0.1$) all pairwise CoSimRank query within 4 hours, using a single commodity server, while existing solutions fail to terminate within a day.
