Table of Contents
Fetching ...

Counting Butterflies over Streaming Bipartite Graphs with Duplicate Edges

Lingkai Meng, Long Yuan, Xuemin Lin, Chengjie Li, Kai Wang, Wenjie Zhang

TL;DR

This work tackles butterfly counting in streaming bipartite graphs with duplicate edges by introducing DEABC, a bucket-based priority sampling method that achieves memory-efficient, unbiased estimates with low variance. DEABC replaces the heavy priority-queue approach of prior methods with a fixed set of $M$ buckets and a Flajolet–Martin–style sketch to estimate the number of distinct edges, enabling accurate correction of butterfly counts without tracking duplicates explicitly. The authors prove unbiasedness and derive variance and error bounds, and demonstrate that DEABC outperforms state-of-the-art baselines in accuracy, throughput, and memory usage on real-world datasets with duplicates. The approach has practical impact for real-time graph analytics in domains where streaming bipartite data and duplicates are common, providing reliable butterfly-based measures under tight memory constraints.

Abstract

Bipartite graphs are commonly used to model relationships between two distinct entities in real-world applications, such as user-product interactions, user-movie ratings and collaborations between authors and publications. A butterfly (a 2x2 bi-clique) is a critical substructure in bipartite graphs, playing a significant role in tasks like community detection, fraud detection, and link prediction. As more real-world data is presented in a streaming format, efficiently counting butterflies in streaming bipartite graphs has become increasingly important. However, most existing algorithms typically assume that duplicate edges are absent, which is hard to hold in real-world graph streams, as a result, they tend to sample edges that appear multiple times, leading to inaccurate results. The only algorithm designed to handle duplicate edges is FABLE, but it suffers from significant limitations, including high variance, substantial time complexity, and memory inefficiency due to its reliance on a priority queue. To overcome these limitations, we introduce DEABC (Duplicate-Edge-Aware Butterfly Counting), an innovative method that uses bucket-based priority sampling to accurately estimate the number of butterflies, accounting for duplicate edges. Compared to existing methods, DEABC significantly reduces memory usage by storing only the essential sampled edge data while maintaining high accuracy. We provide rigorous proofs of the unbiasedness and variance bounds for DEABC, ensuring they achieve high accuracy. We compare DEABC with state-of-the-art algorithms on real-world streaming bipartite graphs. The results show that our DEABC outperforms existing methods in memory efficiency and accuracy, while also achieving significantly higher throughput.

Counting Butterflies over Streaming Bipartite Graphs with Duplicate Edges

TL;DR

This work tackles butterfly counting in streaming bipartite graphs with duplicate edges by introducing DEABC, a bucket-based priority sampling method that achieves memory-efficient, unbiased estimates with low variance. DEABC replaces the heavy priority-queue approach of prior methods with a fixed set of buckets and a Flajolet–Martin–style sketch to estimate the number of distinct edges, enabling accurate correction of butterfly counts without tracking duplicates explicitly. The authors prove unbiasedness and derive variance and error bounds, and demonstrate that DEABC outperforms state-of-the-art baselines in accuracy, throughput, and memory usage on real-world datasets with duplicates. The approach has practical impact for real-time graph analytics in domains where streaming bipartite data and duplicates are common, providing reliable butterfly-based measures under tight memory constraints.

Abstract

Bipartite graphs are commonly used to model relationships between two distinct entities in real-world applications, such as user-product interactions, user-movie ratings and collaborations between authors and publications. A butterfly (a 2x2 bi-clique) is a critical substructure in bipartite graphs, playing a significant role in tasks like community detection, fraud detection, and link prediction. As more real-world data is presented in a streaming format, efficiently counting butterflies in streaming bipartite graphs has become increasingly important. However, most existing algorithms typically assume that duplicate edges are absent, which is hard to hold in real-world graph streams, as a result, they tend to sample edges that appear multiple times, leading to inaccurate results. The only algorithm designed to handle duplicate edges is FABLE, but it suffers from significant limitations, including high variance, substantial time complexity, and memory inefficiency due to its reliance on a priority queue. To overcome these limitations, we introduce DEABC (Duplicate-Edge-Aware Butterfly Counting), an innovative method that uses bucket-based priority sampling to accurately estimate the number of butterflies, accounting for duplicate edges. Compared to existing methods, DEABC significantly reduces memory usage by storing only the essential sampled edge data while maintaining high accuracy. We provide rigorous proofs of the unbiasedness and variance bounds for DEABC, ensuring they achieve high accuracy. We compare DEABC with state-of-the-art algorithms on real-world streaming bipartite graphs. The results show that our DEABC outperforms existing methods in memory efficiency and accuracy, while also achieving significantly higher throughput.

Paper Structure

This paper contains 13 sections, 7 theorems, 27 equations, 32 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

Algorithm algo:fable takes $O(m^{(t)} + (M+M\cdot \ln{\frac{m_d^{(t)}+1}{M}}) \cdot (\log M + d_{{max}}^2))$ time to process $t$ elements in the input bipartite graph stream, where $M$ is the maximum number of edges in the sample, $d_{{max}}$ is the maximum degree of any vertex in the sampled subgra

Figures (32)

  • Figure 1: Butterflies in A Bipartite Graph
  • Figure 2: Twitter-ut (edges)
  • Figure 3: Edit-frwiki (edges)
  • Figure 4: AmazonRatings (edges)
  • Figure 5: Movie-lens (edges)
  • ...and 27 more figures

Theorems & Definitions (16)

  • Definition 1: Butterfly
  • Definition 2: Bipartite Graph Stream
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • ...and 6 more