Table of Contents
Fetching ...

Fast and Accurate Triangle Counting in Graph Streams Using Predictions

Cristian Boldrin, Fabio Vandin

TL;DR

This work tackles fast and accurate triangle counting in large graph streams under fixed memory. It introduces Tonic, a one-pass algorithm that blends Waiting Room Sampling, Reservoir Sampling, and a heaviness predictor to focus on edges most involved in triangles, with a simple MinDegreePredictor provided as a practical option. The authors prove unbiasedness of the estimates, analyze time and space complexity, and show variance reductions over state-of-the-art methods when the predictor is informative, with robustness to adversarial predictors. Empirical results on real, large-scale graphs demonstrate that Tonic outperforms prior approaches in accuracy and speed, especially across sequences of hundreds of streams and in fully dynamic settings, making it practical for deployment in streaming graph analytics contexts.

Abstract

In this work, we present the first efficient and practical algorithm for estimating the number of triangles in a graph stream using predictions. Our algorithm combines waiting room sampling and reservoir sampling with a predictor for the heaviness of edges, that is, the number of triangles in which an edge is involved. As a result, our algorithm is fast, provides guarantees on the amount of memory used, and exploits the additional information provided by the predictor to produce highly accurate estimates. We also propose a simple and domain-independent predictor, based on the degree of nodes, that can be easily computed with one pass on a stream of edges when the stream is available beforehand. Our analytical results show that, when the predictor provides useful information on the heaviness of edges, it leads to estimates with reduced variance compared to the state-of-the-art, even when the predictions are far from perfect. Our experimental results show that, when analyzing a single graph stream, our algorithm is faster than the state-of-the-art for a given memory budget, while providing significantly more accurate estimates. Even more interestingly, when sequences of hundreds of graph streams are analyzed, our algorithm significantly outperforms the state-of-the-art using our simple degree-based predictor built by analyzing only the first graph of the sequence.

Fast and Accurate Triangle Counting in Graph Streams Using Predictions

TL;DR

This work tackles fast and accurate triangle counting in large graph streams under fixed memory. It introduces Tonic, a one-pass algorithm that blends Waiting Room Sampling, Reservoir Sampling, and a heaviness predictor to focus on edges most involved in triangles, with a simple MinDegreePredictor provided as a practical option. The authors prove unbiasedness of the estimates, analyze time and space complexity, and show variance reductions over state-of-the-art methods when the predictor is informative, with robustness to adversarial predictors. Empirical results on real, large-scale graphs demonstrate that Tonic outperforms prior approaches in accuracy and speed, especially across sequences of hundreds of streams and in fully dynamic settings, making it practical for deployment in streaming graph analytics contexts.

Abstract

In this work, we present the first efficient and practical algorithm for estimating the number of triangles in a graph stream using predictions. Our algorithm combines waiting room sampling and reservoir sampling with a predictor for the heaviness of edges, that is, the number of triangles in which an edge is involved. As a result, our algorithm is fast, provides guarantees on the amount of memory used, and exploits the additional information provided by the predictor to produce highly accurate estimates. We also propose a simple and domain-independent predictor, based on the degree of nodes, that can be easily computed with one pass on a stream of edges when the stream is available beforehand. Our analytical results show that, when the predictor provides useful information on the heaviness of edges, it leads to estimates with reduced variance compared to the state-of-the-art, even when the predictions are far from perfect. Our experimental results show that, when analyzing a single graph stream, our algorithm is faster than the state-of-the-art for a given memory budget, while providing significantly more accurate estimates. Even more interestingly, when sequences of hundreds of graph streams are analyzed, our algorithm significantly outperforms the state-of-the-art using our simple degree-based predictor built by analyzing only the first graph of the sequence.
Paper Structure (38 sections, 8 theorems, 22 equations, 14 figures, 4 tables, 5 algorithms)

This paper contains 38 sections, 8 theorems, 22 equations, 14 figures, 4 tables, 5 algorithms.

Key Result

Theorem 4.1

Let $T^{(t)}$ and $T_u^{(t)}$ be the true global count of triangles in the graph and the true local triangle count for node $u \in V$ at time $t$, respectively. We have:

Figures (14)

  • Figure 1: Error vs fraction of memory budget used for waiting room and/or heavy edges. All algorithms are provided with memory budget $k = m/10$. For each combination of algorithm and parameter (including predictor for Tonic), the average and 95% confidence interval over 50 repetitions are shown. The chosen configuration for Tonic and the configurations suggested by WRS and Chen publications are highlighted.
  • Figure 2: Error (left) and runtime (right) vs memory budget. Each left subplot reports the size of OracleExact and Oracle-noWR as number of edges, and the size of MinDegreePredictor as number of nodes. Chen runtimes are not shown for clarity, since they are 4-15 times bigger than Tonic runtime. For each combination of algorithm and parameter (including predictor for Tonic), the average and standard deviation over 50 repetitions are shown. The algorithms parameters are as in legend (for WRS and Chen they are fixed as in the respective publications; for Tonic they are as chosen in Fig. \ref{['fig:accuracy_params_experiments_merged']}).
  • Figure 3: (Left) Distribution of estimates for Patents dataset. (Center-left) Estimation error as time progresses on Actors dataset. (Center-right) Number and type of triangles counted by each algorithm on three datasets. (Right) Fraction of each type of triangles in the total estimates by each algorithm on three datasets.
  • Figure 4: Error with snapshot networks with sequence of graph streams. The bottom plots are for the first 400 streams (left) and the remaining streams (right) of AS-733. In all cases the predictors are trained only on the first graph stream of the sequence (with results not shown on such graph stream). For each combination of algorithm and parameter (including predictor for Tonic), the average and standard deviation over 50 repetitions are shown. The algorithms parameters are as in legend (for WRS and Chen they are fixed as in the respective publications; for Tonic they are as chosen in Fig. \ref{['fig:accuracy_params_experiments_merged']}).
  • Figure 5: Estimation error as time progresses during Oregon fully dynamic stream. For each combination of algorithm and parameter (including predictor for Tonic-FD), the average and standard deviation over 50 repetitions are shown. The algorithms parameters are as in legend (for WRS they are fixed as in the respective publication; for Tonic-FD they are as chosen in Fig. \ref{['fig:accuracy_params_experiments_merged']}). $\bar{n}$: number of unique nodes; $\bar{m}$: number of unique edges; $m_{max}$: maximum number of edges at some time; $m$: total number of edges; $T_m$: number of global triangles at the end, derived from the FD stream.
  • ...and 9 more figures

Theorems & Definitions (14)

  • Theorem 4.1
  • Theorem 4.2
  • Proposition 1
  • Proposition 2
  • Lemma D.1
  • proof
  • Lemma D.2
  • proof
  • Theorem D.3
  • proof
  • ...and 4 more