Table of Contents
Fetching ...

EdgeSketch: Efficient Analysis of Massive Graph Streams

Jakub Lemiesz, Dingqi Yang, Philippe Cudré-Mauroux

TL;DR

EdgeSketch introduces a compact, universal graph representation built in a single pass over a graph edge stream, enabling unbiased estimators for fundamental graph statistics and direct execution of algorithms on the sketch. It fuses ideas from ExpSketch, FastExpSketch, and NodeSketch to store per-node sketches (F and S arrays) that support set operations, degrees, edge counts, and subgraph queries, while handling directed/undirected graphs and even hypergraphs. The paper demonstrates practical benefits through two core applications—community detection via a sketch-based Louvain method and graph reconstruction—showing substantial memory savings and runtime improvements with controlled accuracy, underpinned by analytical variance bounds that decay as $O(1/m)$. The approach scales to massive graphs (e.g., SBM with $|E| \approx 2.85\times 10^8$ and real-world bipartite-to-unipartite graphs with billions of edges), enabling streaming analytics where access to the full graph is infeasible and enabling robust multi-task graph analysis from compact sketches.

Abstract

We introduce EdgeSketch, a compact graph representation for efficient analysis of massive graph streams. EdgeSketch provides unbiased estimators for key graph properties with controllable variance and supports implementing graph algorithms on the stored summary directly. It is constructed in a fully streaming manner, requiring a single pass over the edge stream, while offline analysis relies solely on the sketch. We evaluate the proposed approach on two representative applications: community detection via the Louvain method and graph reconstruction through node similarity estimation. Experiments demonstrate substantial memory savings and runtime improvements over both lossless representations and prior sketching approaches, while maintaining reliable accuracy.

EdgeSketch: Efficient Analysis of Massive Graph Streams

TL;DR

EdgeSketch introduces a compact, universal graph representation built in a single pass over a graph edge stream, enabling unbiased estimators for fundamental graph statistics and direct execution of algorithms on the sketch. It fuses ideas from ExpSketch, FastExpSketch, and NodeSketch to store per-node sketches (F and S arrays) that support set operations, degrees, edge counts, and subgraph queries, while handling directed/undirected graphs and even hypergraphs. The paper demonstrates practical benefits through two core applications—community detection via a sketch-based Louvain method and graph reconstruction—showing substantial memory savings and runtime improvements with controlled accuracy, underpinned by analytical variance bounds that decay as . The approach scales to massive graphs (e.g., SBM with and real-world bipartite-to-unipartite graphs with billions of edges), enabling streaming analytics where access to the full graph is infeasible and enabling robust multi-task graph analysis from compact sketches.

Abstract

We introduce EdgeSketch, a compact graph representation for efficient analysis of massive graph streams. EdgeSketch provides unbiased estimators for key graph properties with controllable variance and supports implementing graph algorithms on the stored summary directly. It is constructed in a fully streaming manner, requiring a single pass over the edge stream, while offline analysis relies solely on the sketch. We evaluate the proposed approach on two representative applications: community detection via the Louvain method and graph reconstruction through node similarity estimation. Experiments demonstrate substantial memory savings and runtime improvements over both lossless representations and prior sketching approaches, while maintaining reliable accuracy.
Paper Structure (31 sections, 1 theorem, 41 equations, 2 figures, 3 tables, 4 algorithms)

This paper contains 31 sections, 1 theorem, 41 equations, 2 figures, 3 tables, 4 algorithms.

Key Result

lemma 1

Fix any position $k$ in the sketch, and let $E$ be a set of edges with associated weights $\{w_e\}_{e \in E}$. Let $T_e \sim \mathrm{Exp}(w_e)$ be the random variable generated for each edge $e$ on line 9 of Algorithm alg:UpdateES. Then, $\,S_{C,k} = \min_{e \in E} T_e\,$ and $\,F_{C,k} = \arg\min_{

Figures (2)

  • Figure 1: Experimental results on graph $\mathbb{H}$. In Figures (a)--(c), the reference value $mod_{\mathbb{H}}(\mathcal{P}_{AL})$ is denoted by a dashed line, while EdgeSketch-based estimates are computed for sketch sizes $m$ ranging from 20 to 300 in steps of 20. For each $m$, we run 50 trials, reporting the mean (solid line) and standard deviation (shaded region). Figure (d) depicts the dynamic setting: edges arrive in 60 batches of $6 \times 10^4$ randomly selected edges each, with three modularity approximations computed after every batch ($m=100$).
  • Figure 2: Edge reconstruction precision $P_t$ on graph $\mathbb{H}$ for three sketches under equal memory budgets, with embedding depth $k=4$ and decay parameter $\alpha=0.2$.

Theorems & Definitions (2)

  • lemma 1
  • proof