Near Uniform Triangle Sampling Over Adjacency List Graph Streams
Arijit Bishnu, Arijit Ghosh, Gopinath Mishra, Sayantan Sen
TL;DR
The paper addresses near-uniform triangle sampling from graph streams, formalizing the objective as outputting a triangle from a distribution within $\varepsilon$ in $\ell_1$ distance of uniform over all triangles. It introduces a heavy-light edge framework and develops both multi-pass and one-pass algorithms in the Adjacency List model, achieving space that matches the corresponding triangle-counting bounds, namely $\tilde{\Theta}(m/T^{2/3})$ for 3-pass and $\tilde{\Theta}(m/\sqrt{T})$ for 1-pass; it also extends sampling results to Ea/Va models with $\tilde{O}(\min\{m, m^{2}/T\})$ and $\tilde{O}(m^{3/2}/T)$ bounds, respectively, along with lower bounds. The key techniques include charging triangles to edges, distinguishing heavy and light edges/triangles, and running parallel subroutines (SampLightHelper and SampHeavyHelper) to enable single-pass sampling. The findings have significant implications for streaming subgraph queries and sampling-with-guarantees, with potential applications in databases, social networks, and biology, and open avenues for sampling other substructures in different streaming models.
Abstract
Triangle counting and sampling are two fundamental problems for streaming algorithms. Arguably, designing sampling algorithms is more challenging than their counting variants. It may be noted that triangle counting has received far greater attention in the literature than the sampling variant. In this work, we consider the problem of approximately sampling triangles in different models of streaming with the focus being on the adjacency list model. In this problem, the edges of a graph $G$ will arrive over a data stream. The goal is to design efficient streaming algorithms that can sample and output a triangle from a distribution, over the triangles in $G$, that is close to the uniform distribution over the triangles in $G$. The distance between distributions is measured in terms of $\ell_1$-distance. The main technical contribution of this paper is to design algorithms for this triangle sampling problem in the adjacency list model with the space complexities matching their counting variants. For the sake of completeness, we also show results on the vertex and edge arrival models.
