Rhizomes and Diffusions for Processing Highly Skewed Graphs on Fine-Grain Message-Driven Systems
Bibrak Qamar Chandio, Prateek Srivastava, Maciej Brodowicz, Martin Swany, Thomas Sterling
TL;DR
The paper tackles the challenge of exploiting latent, fine-grained parallelism in highly skewed graphs by co-designing a diffusive, asynchronous programming model, an action-based language for spawning work at data locality, and a novel vertex-centric data structure (Rhizomes) that distributes both out-degree and in-degree workloads across many cores. The approach hinges on RPVOs that form recursive hierarchies for vertex data and ghost nodes, augmented by rhizome-links that share in-degree loads while preserving a coherent global view via event-driven synchronization (rhizome-collapse) and Local Control Objects (LCOs) for fine-grained coordination. Through a CCA-inspired simulator, the authors demonstrate performance gains for BFS, SSSP, and PageRank on graphs with highly skewed degree distributions, and show how diffusive pruning and throttling mitigate network congestion while preserving compute occupancy. The work advances near-memory, PGAS-oriented architectures by enabling dynamic graph processing, providing dynamic task creation with actions, and offering scalable, asynchronous execution without global barriers, with practical implications for future large-scale, memory-bound graph analytics. The reported results indicate meaningful speedups and energy considerations when comparing topology variants and existing approaches, underscoring the potential of rhizome-based vertex abstractions for real-time, dynamic graph workloads.
Abstract
The paper provides a unified co-design of 1) a programming and execution model that allows spawning tasks from within the vertex data at runtime, 2) language constructs for \textit{actions} that send work to where the data resides, combining parallel expressiveness of local control objects (LCOs) to implement asynchronous graph processing primitives, 3) and an innovative vertex-centric data-structure, using the concept of Rhizomes, that parallelizes both the out and in-degree load of vertex objects across many cores and yet provides a single programming abstraction to the vertex objects. The data structure hierarchically parallelizes the out-degree load of vertices and the in-degree load laterally. The rhizomes internally communicate and remain consistent, using event-driven synchronization mechanisms, to provide a unified and correct view of the vertex. Simulated experimental results show performance gains for BFS, SSSP, and Page Rank on large chip sizes for the tested input graph datasets containing highly skewed degree distribution. The improvements come from the ability to express and create fine-grain dynamic computing task in the form of \textit{actions}, language constructs that aid the compiler to generate code that the runtime system uses to optimally schedule tasks, and the data structure that shares both in and out-degree compute workload among memory-processing elements.
