Table of Contents
Fetching ...

Accurate and Fast Approximate Graph Pattern Mining at Scale

Anna Arpaci-Dusseau, Zixiang Zhou, Xuhao Chen

TL;DR

ScaleGPM tackles two key bottlenecks in approximate graph pattern mining: unstable termination with no solid confidence guarantees and poor performance on needle-in-the-hay cases. It introduces online convergence detection with a formal $1-\delta$ confidence bound, plus eager-verify pruning and a hybrid NS/GS sampling framework guided by cost models and fast profiling. Empirical results show dramatic speedups (average $\sim565\times$, up to $>6\times10^5\times$) and the ability to handle billion-scale graphs in seconds, outperforming Arya and GraphZero. Together, these advances enable reliable, scalable A-GPM across diverse motifs and large graphs.

Abstract

Approximate graph pattern mining (A-GPM) is an important data analysis tool for many graph-based applications. There exist sampling-based A-GPM systems to provide automation and generalization over a wide variety of use cases. However, there are two major obstacles that prevent existing A-GPM systems being adopted in practice. First, the termination mechanism that decides when to end sampling lacks theoretical backup on confidence, and is unstable and slow in practice. Second, they suffer poor performance when dealing with the "needle-in-the-hay" cases, because a huge number of samples are required to converge, given the extremely low hit rate of their fixed sampling schemes. We build ScaleGPM, an accurate and fast A-GPM system that removes the two obstacles. First, we propose a novel on-the-fly convergence detection mechanism to achieve stable termination and provide theoretical guarantee on the confidence, with negligible overhead. Second, we propose two techniques to deal with the "needle-in-the-hay" problem, eager-verify and hybrid sampling. Our eager-verify method improves sampling hit rate by pruning unpromising candidates as early as possible. Hybrid sampling improves performance by automatically choosing the better scheme between fine-grained and coarse-grained sampling schemes. Experiments show that our online convergence detection mechanism can detect convergence and results in stable and rapid termination with theoretically guaranteed confidence. We show the effectiveness of eager-verify in improving the hit rate, and the scheme-selection mechanism in correctly choosing the better scheme for various cases. Overall, ScaleGPM achieves a geomean average of 565x (up to 610169x) speedup over the state-of-the-art A-GPM system, Arya. In particular, ScaleGPM handles billion-scale graphs in seconds, where existing systems either run out of memory or fail to complete in hours.

Accurate and Fast Approximate Graph Pattern Mining at Scale

TL;DR

ScaleGPM tackles two key bottlenecks in approximate graph pattern mining: unstable termination with no solid confidence guarantees and poor performance on needle-in-the-hay cases. It introduces online convergence detection with a formal confidence bound, plus eager-verify pruning and a hybrid NS/GS sampling framework guided by cost models and fast profiling. Empirical results show dramatic speedups (average , up to ) and the ability to handle billion-scale graphs in seconds, outperforming Arya and GraphZero. Together, these advances enable reliable, scalable A-GPM across diverse motifs and large graphs.

Abstract

Approximate graph pattern mining (A-GPM) is an important data analysis tool for many graph-based applications. There exist sampling-based A-GPM systems to provide automation and generalization over a wide variety of use cases. However, there are two major obstacles that prevent existing A-GPM systems being adopted in practice. First, the termination mechanism that decides when to end sampling lacks theoretical backup on confidence, and is unstable and slow in practice. Second, they suffer poor performance when dealing with the "needle-in-the-hay" cases, because a huge number of samples are required to converge, given the extremely low hit rate of their fixed sampling schemes. We build ScaleGPM, an accurate and fast A-GPM system that removes the two obstacles. First, we propose a novel on-the-fly convergence detection mechanism to achieve stable termination and provide theoretical guarantee on the confidence, with negligible overhead. Second, we propose two techniques to deal with the "needle-in-the-hay" problem, eager-verify and hybrid sampling. Our eager-verify method improves sampling hit rate by pruning unpromising candidates as early as possible. Hybrid sampling improves performance by automatically choosing the better scheme between fine-grained and coarse-grained sampling schemes. Experiments show that our online convergence detection mechanism can detect convergence and results in stable and rapid termination with theoretically guaranteed confidence. We show the effectiveness of eager-verify in improving the hit rate, and the scheme-selection mechanism in correctly choosing the better scheme for various cases. Overall, ScaleGPM achieves a geomean average of 565x (up to 610169x) speedup over the state-of-the-art A-GPM system, Arya. In particular, ScaleGPM handles billion-scale graphs in seconds, where existing systems either run out of memory or fail to complete in hours.
Paper Structure (28 sections, 2 theorems, 1 equation, 15 figures, 5 tables, 4 algorithms)

This paper contains 28 sections, 2 theorems, 1 equation, 15 figures, 5 tables, 4 algorithms.

Key Result

Theorem 1

Given $\delta$, $n$ samples $X_1,\dots,X_n$ drawn by using the NS sampling scheme, and the mean of sampled counts $\mu = \frac{1}{n}\sum_{i=1}^n X_i$, let $C$ be the true count and $\hat{\epsilon}$ be the estimated error computed by equ:error. As $n \to \infty$, the probability of the true relative

Figures (15)

  • Figure 1: Three different runs (three curves) of Arya's ELP prediction given the LiveJ graph and triangle pattern. With an error bound of 10%, the curves give dramatically different prediction on the number of samples $N_s$: 5,260, 26,510 and 121,210. This leads to a 25$\times$ performance difference in the sampling execution phase.
  • Figure 2: Sample hits and misses in Arya, on Twitter40 (top) and Friendster (bottom) graphs. The pattern is 4-clique for both. In total $10^8$ samples are drawn in both cases. Each green point is a hit sample, while each red point is a miss sample. For Twitter40, there are 7,033 hits with a hit rate of $7\times 10^{-5}$. For Friendster, there are only 5 hits with a $5\times 10^{-8}$ hit rate (i.e. needle in the hay).
  • Figure 3: Execution time variance of Neighbor Sampling (NS) and Graph Sparsification (GS), under the same error bounds.
  • Figure 4: The normal distribution of the means of sampled counts (i.e. our predicted counts) using neighbor sampling (NS). We ran NS to collect $10^6$ samples on LiveJ, 4-clique. We obtained a predicted count by taking the mean of a random subset of 100 of these underlying samples. We simulated 1000 of these predicted counts. Although the underlying distribution of the sampled counts (green bars) is not a normal distribution, their means (purple bars), which are our predicted counts, do follow a normal distribution (dashed red line).
  • Figure 5: Comparing the hit rate of NS-prune with NS-base, with the 4-clique pattern on various graphs.
  • ...and 10 more figures

Theorems & Definitions (2)

  • Theorem 1
  • lemma 1