Table of Contents
Fetching ...

Fast Approximate CoSimRanks via Random Projections

Renchi Yang, Xiaokui Xiao

TL;DR

This work tackles the expensive problem of all-pairs CoSimRank computation by introducing RPCS, a randomized algorithm that projects the $n\times n$ random-walk matrix into $n\times d$ using Johnson-Lindenstrauss-type random projections. By iteratively accumulating low-rank factors via updates $\mathbf{H}^{(k)}=\sqrt{c}\,\mathbf{P}\mathbf{H}^{(k-1)}$ and $\widehat{\mathbf{S}}=\mathbf{I}+\sum_k \mathbf{H}^{(k)}(\mathbf{H}^{(k)})^{\top}$, RPCS achieves an $ε$-accurate approximation for all entries with high probability, while reducing the per-iteration cost to $O(n^2d)$ and overall time to $O\left(\min\left\{\frac{n^2\ln n}{ε^2}\ln\frac{1}{ε},\,n^3\ln\frac{1}{ε}\right\}\right)$. The method is backed by a theoretical error bound via inner-product preservation and a procedure to select a projection dimension $d$ and parameter $δ$ to optimize runtime. Empirical results on six real graphs demonstrate substantial speedups over state-of-the-art methods, enabling ε-approximate all-pairs CoSimRank queries on large datasets such as a million-edge Twitter graph on a single commodity server.

Abstract

Given a graph $G$ with $n$ nodes and two nodes $u,v\in G$, the {\em CoSimRank} value $s(u,v)$ quantifies the similarity between $u$ and $v$ based on graph topology. Compared to SimRank, CoSimRank is shown to be more accurate and effective in many real-world applications, including synonym expansion, lexicon extraction, and entity relatedness in knowledge graphs. The computation of all pairwise CoSimRanks in $G$ is highly expensive and challenging. Existing solutions all focus on devising approximate algorithms for the computation of all pairwise CoSimRanks. To attain a desired absolute accuracy guarantee $ε$, the state-of-the-art approximate algorithm for computing all pairwise CoSimRanks requires $O(n^3\log_2(\ln(\frac{1}ε)))$ time, which is prohibitively expensive even though $ε$ is large. In this paper, we propose \rsim, a fast randomized algorithm for computing all pairwise CoSimRank values. The basic idea of \rsim is to approximate the $n\times n$ matrix multiplications in CoSimRank computation via random projection. Theoretically, \rsim runs in $O(\frac{n^2\ln(n)}{ε^2}\ln(\frac{1}ε))$ time and meanwhile ensures an absolute error of at most $ε$ in each CoSimRank value in $G$ with a high probability. Extensive experiments using six real graphs demonstrate that \rsim is more than orders of magnitude faster than the state of the art. In particular, on a million-edge Twitter graph, \rsim answers the $ε$-approximate ($ε=0.1$) all pairwise CoSimRank query within 4 hours, using a single commodity server, while existing solutions fail to terminate within a day.

Fast Approximate CoSimRanks via Random Projections

TL;DR

This work tackles the expensive problem of all-pairs CoSimRank computation by introducing RPCS, a randomized algorithm that projects the random-walk matrix into using Johnson-Lindenstrauss-type random projections. By iteratively accumulating low-rank factors via updates and , RPCS achieves an -accurate approximation for all entries with high probability, while reducing the per-iteration cost to and overall time to . The method is backed by a theoretical error bound via inner-product preservation and a procedure to select a projection dimension and parameter to optimize runtime. Empirical results on six real graphs demonstrate substantial speedups over state-of-the-art methods, enabling ε-approximate all-pairs CoSimRank queries on large datasets such as a million-edge Twitter graph on a single commodity server.

Abstract

Given a graph with nodes and two nodes , the {\em CoSimRank} value quantifies the similarity between and based on graph topology. Compared to SimRank, CoSimRank is shown to be more accurate and effective in many real-world applications, including synonym expansion, lexicon extraction, and entity relatedness in knowledge graphs. The computation of all pairwise CoSimRanks in is highly expensive and challenging. Existing solutions all focus on devising approximate algorithms for the computation of all pairwise CoSimRanks. To attain a desired absolute accuracy guarantee , the state-of-the-art approximate algorithm for computing all pairwise CoSimRanks requires time, which is prohibitively expensive even though is large. In this paper, we propose \rsim, a fast randomized algorithm for computing all pairwise CoSimRank values. The basic idea of \rsim is to approximate the matrix multiplications in CoSimRank computation via random projection. Theoretically, \rsim runs in time and meanwhile ensures an absolute error of at most in each CoSimRank value in with a high probability. Extensive experiments using six real graphs demonstrate that \rsim is more than orders of magnitude faster than the state of the art. In particular, on a million-edge Twitter graph, \rsim answers the -approximate () all pairwise CoSimRank query within 4 hours, using a single commodity server, while existing solutions fail to terminate within a day.

Paper Structure

This paper contains 13 sections, 4 theorems, 19 equations, 1 figure, 3 tables, 2 algorithms.

Key Result

lemma thmcounterlemma

Given any integer $k\ge 1$, $\sum_{v_j\in V}\mathbf{P}\xspace^k[i,j]=1\ \forall{v_i\in V}$ holds.

Figures (1)

  • Figure 1: Running time with varying $\epsilon$.

Theorems & Definitions (8)

  • lemma thmcounterlemma
  • definition thmcounterdefinition: CoSimRank rothe2014cosimrank
  • definition thmcounterdefinition: $\epsilon$-approximate all pairwise CoSimRank query yu2015co
  • lemma thmcounterlemma: (Preservation of inner products kaban2015improved
  • lemma thmcounterlemma
  • proof
  • theorem thmcountertheorem
  • proof