Table of Contents
Fetching ...

Approximate Butterfly Counting in Sublinear Time

Chi Luo, Jiaxin Song, Yuhao Zhang, Kai Wang, Zhixing He, Kuan Yang

Abstract

Bipartite graphs serve as a natural model for representing relationships between two different types of entities. When analyzing bipartite graphs, butterfly counting is a fundamental research problem that aims to count the number of butterflies (i.e., 2x2 bicliques) in a given bipartite graph. While this problem has been extensively studied in the literature, existing algorithms usually necessitate access to a large portion of the entire graph, presenting challenges in real scenarios where graphs are extremely large and I/O costs are expensive. In this paper, we study the butterfly counting problem under the query model, where the following query operations are permitted: degree query, neighbor query, and vertex-pair query. We propose TLS, a practical two-level sampling algorithm that can estimate the butterfly count accurately while accessing only a limited graph structure, achieving significantly lower query costs under the standard query model. TLS also incorporates several key techniques to control the variance, including "small-degree-first sampling" and "wedge sampling via small subsets". To ensure theoretical guarantees, we further introduce two novel techniques: "heavy-light partition" and "guess-and-prove", integrated into TLS. With these techniques, we prove that the algorithm can achieve a (1+eps) accuracy for any given approximation parameter 0 < eps < 1 on general bipartite graphs with a promised time and query complexity. In particular, the promised time is sublinear when the input graph is dense enough. Extensive experiments on 15 datasets demonstrate that TLS delivers robust estimates with up to three orders of magnitude lower query costs and runtime compared to existing solutions.

Approximate Butterfly Counting in Sublinear Time

Abstract

Bipartite graphs serve as a natural model for representing relationships between two different types of entities. When analyzing bipartite graphs, butterfly counting is a fundamental research problem that aims to count the number of butterflies (i.e., 2x2 bicliques) in a given bipartite graph. While this problem has been extensively studied in the literature, existing algorithms usually necessitate access to a large portion of the entire graph, presenting challenges in real scenarios where graphs are extremely large and I/O costs are expensive. In this paper, we study the butterfly counting problem under the query model, where the following query operations are permitted: degree query, neighbor query, and vertex-pair query. We propose TLS, a practical two-level sampling algorithm that can estimate the butterfly count accurately while accessing only a limited graph structure, achieving significantly lower query costs under the standard query model. TLS also incorporates several key techniques to control the variance, including "small-degree-first sampling" and "wedge sampling via small subsets". To ensure theoretical guarantees, we further introduce two novel techniques: "heavy-light partition" and "guess-and-prove", integrated into TLS. With these techniques, we prove that the algorithm can achieve a (1+eps) accuracy for any given approximation parameter 0 < eps < 1 on general bipartite graphs with a promised time and query complexity. In particular, the promised time is sublinear when the input graph is dense enough. Extensive experiments on 15 datasets demonstrate that TLS delivers robust estimates with up to three orders of magnitude lower query costs and runtime compared to existing solutions.
Paper Structure (17 sections, 15 theorems, 18 equations, 6 figures, 3 tables, 6 algorithms)

This paper contains 17 sections, 15 theorems, 18 equations, 6 figures, 3 tables, 6 algorithms.

Key Result

Lemma 1

The peak memory usage of Algorithm Alg_espar is $O(p\cdot|E| + |V|)$.

Figures (6)

  • Figure 1: A bipartite graph instance.
  • Figure 2: A bipartite graph containing high degree vertices $u_0$, $u_1$, $v_{1000}$ and $v_{1001}$.
  • Figure 3: Overall comparison of different metrics.
  • Figure 4: Relative errors under fixed time/query.
  • Figure 5: Time and query cost of obtaining 3% relative error on varying graph density
  • ...and 1 more figures

Theorems & Definitions (29)

  • Definition 1: Wedge
  • Definition 2: Butterfly
  • Lemma 1: Peak Memory Usage of The ${\tt ESpar}$ Algorithm
  • proof
  • Lemma 2: Peak Memory Usage of The ${\tt WPS}$ Algorithm
  • proof
  • Lemma 3: Efficiency of The ${\tt TLS}$ Algorithm
  • proof
  • Lemma 4: Peak Memory Usage of The ${\tt TLS}$ Algorithm
  • proof
  • ...and 19 more