SoK: The Faults in our Graph Benchmarks

Puneet Mehrotra; Vaastav Anand; Daniel Margo; Milad Rezaei Hajidehi; Margo Seltzer

SoK: The Faults in our Graph Benchmarks

Puneet Mehrotra, Vaastav Anand, Daniel Margo, Milad Rezaei Hajidehi, Margo Seltzer

TL;DR

This SoK investigates the pervasive inadequacies in graph benchmarking, including dataset idiosyncrasies, reliance on unrealistic synthetic generators, and inconsistent reporting that undermine cross-study comparisons. It combines a 12-year literature review with a quantitative study of how vertex orderings and zero-degree vertices affect performance and correctness across multiple systems, revealing substantial, sometimes order-dependent, variance (up to around 40% in some cases and large BFS speedups when starting from zero-degree vertices). The authors propose a concrete set of best practices, including standardized benchmark suites, richer metrics, careful preprocessing reporting, and the use of diverse real and synthetic datasets, to improve reproducibility and guidance for developers. Together, these contributions advocate for principled benchmarking that better reflects real-world workloads and evolvable graph data, with the goal of advancing reliable and comparable graph-processing technology.

Abstract

Graph-structured data is prevalent in domains such as social networks, financial transactions, brain networks, and protein interactions. As a result, the research community has produced new databases and analytics engines to process such data. Unfortunately, there is not yet widespread benchmark standardization in graph processing, and the heterogeneity of evaluations found in the literature can lead researchers astray. Evaluations frequently ignore datasets' statistical idiosyncrasies, which significantly affect system performance. Scalability studies often use datasets that fit easily in memory on a modest desktop. Some studies rely on synthetic graph generators, but these generators produce graphs with unnatural characteristics that also affect performance, producing misleading results. Currently, the community has no consistent and principled manner with which to compare systems and provide guidance to developers who wish to select the system most suited to their application. We provide three different systematizations of benchmarking practices. First, we present a 12-year literary review of graph processing benchmarking, including a summary of the prevalence of specific datasets and benchmarks used in these papers. Second, we demonstrate the impact of two statistical properties of datasets that drastically affect benchmark performance. We show how different assignments of IDs to vertices, called vertex orderings, dramatically alter benchmark performance due to the caching behavior they induce. We also show the impact of zero-degree vertices on the runtime of benchmarks such as breadth-first search and single-source shortest path. We show that these issues can cause performance to change by as much as 38% on several popular graph processing systems. Finally, we suggest best practices to account for these issues when evaluating graph systems.

SoK: The Faults in our Graph Benchmarks

TL;DR

Abstract

Paper Structure (26 sections, 1 equation, 11 figures, 5 tables)

This paper contains 26 sections, 1 equation, 11 figures, 5 tables.

Introduction
12 Years of Graph System Benchmarking
Methodology
Selecting Papers
Findings
Datasets
Synthetic Graph Generators
Benchmarks
Co-occurence of datasets and benchmarks
Conference Conformance Bias
Artifact evaluation of graph processing systems
Quantitative Study
Effect of Vertex Orderings
Effect on Benchmark Performance
Effect on Benchmark Correctness
...and 11 more sections

Figures (11)

Figure 1: Our procedure to identify contemporary graph system papers. 1.) Our corpus consists of papers published at 10 conferences held between 2011 to 2023 and are pruned using common-sense graph terms. 2.) We filter our corpus with major graph system names to obtain candidate papers. 3.) We read the papers and extract more graph system terms from them. 4.) We update the filter with the new terms and repeat the process.
Figure 2: Frequency of usage of datasets and benchmarks that appear in our 227 paper corpus. The red axis shows the number of papers that use that particular dataset or benchmark. The blue axis shows the CDF of the percentage of the usage of a particular dataset or benchmark out of the total usage of the total usage of all datasets or benchmarks across our paper corpus.
Figure 3: The frequency of co-usage of the top-10 benchmarks and top-10 datasets
Figure 4: Partitionability of graphs generated by Smooth Kronecker compared to graphs generated by Noisy Kronecker generators
Figure 5: (a) shows the use of the top-10 datasets across conferences; (b) shows the use of the top-10 benchmarks across conferences. We omit PODS as there is no paper from PODS in our corpus.
...and 6 more figures

SoK: The Faults in our Graph Benchmarks

TL;DR

Abstract

SoK: The Faults in our Graph Benchmarks

Authors

TL;DR

Abstract

Table of Contents

Figures (11)