Covering a Graph with Dense Subgraph Families, via Triangle-Rich Sets
Sabyasachi Basu, Daniel Paul-Pena, Kun Qian, C. Seshadhri, Edward W Huang, Karthik Subbian
TL;DR
This paper introduces regularly triangle-rich (RTR) sets to formalize the goal of covering large graphs with many dense subgraphs. It presents RTRExtractor, a provably approximate algorithm that outputs a disjoint family of $\Omega(1)$-RTR sets and guarantees that a constant fraction of any $\alpha$-RTR set is captured, with running time tied to triangle enumeration. The authors provide a practical, fast implementation and demonstrate high coverage on real networks, often achieving dense clusters that cover a substantial portion of the graph (e.g., a quarter of vertices in dense subgraphs on large social graphs). Empirically, RTRExtractor yields many large, dense subgraphs whose vertices group into meaningful, label-free communities, highlighting its utility for unsupervised graph discovery. The work also compares RTRExtractor against established dense-subgraph and community-detection methods, showing superior high-density coverage and scalability, and suggests directions for future work including hierarchical organization of outputs and further sparsification techniques.
Abstract
Graphs are a fundamental data structure used to represent relationships in domains as diverse as the social sciences, bioinformatics, cybersecurity, the Internet, and more. One of the central observations in network science is that real-world graphs are globally sparse, yet contains numerous "pockets" of high edge density. A fundamental task in graph mining is to discover these dense subgraphs. Most common formulations of the problem involve finding a single (or a few) "optimally" dense subsets. But in most real applications, one does not care for the optimality. Instead, we want to find a large collection of dense subsets that covers a significant fraction of the input graph. We give a mathematical formulation of this problem, using a new definition of regularly triangle-rich (RTR) families. These families capture the notion of dense subgraphs that contain many triangles and have degrees comparable to the subgraph size. We design a provable algorithm, RTRExtractor, that can discover RTR families that approximately cover any RTR set. The algorithm is efficient and is inspired by recent results that use triangle counts for community testing and clustering. We show that RTRExtractor has excellent behavior on a large variety of real-world datasets. It is able to process graphs with hundreds of millions of edges within minutes. Across many datasets, RTRExtractor achieves high coverage using high edge density datasets. For example, the output covers a quarter of the vertices with subgraphs of edge density more than (say) $0.5$, for datasets with 10M+ edges. We show an example of how the output of RTRExtractor correlates with meaningful sets of similar vertices in a citation network, demonstrating the utility of RTRExtractor for unsupervised graph discovery tasks.
