Table of Contents
Fetching ...

Rhizomes and Diffusions for Processing Highly Skewed Graphs on Fine-Grain Message-Driven Systems

Bibrak Qamar Chandio, Prateek Srivastava, Maciej Brodowicz, Martin Swany, Thomas Sterling

TL;DR

The paper tackles the challenge of exploiting latent, fine-grained parallelism in highly skewed graphs by co-designing a diffusive, asynchronous programming model, an action-based language for spawning work at data locality, and a novel vertex-centric data structure (Rhizomes) that distributes both out-degree and in-degree workloads across many cores. The approach hinges on RPVOs that form recursive hierarchies for vertex data and ghost nodes, augmented by rhizome-links that share in-degree loads while preserving a coherent global view via event-driven synchronization (rhizome-collapse) and Local Control Objects (LCOs) for fine-grained coordination. Through a CCA-inspired simulator, the authors demonstrate performance gains for BFS, SSSP, and PageRank on graphs with highly skewed degree distributions, and show how diffusive pruning and throttling mitigate network congestion while preserving compute occupancy. The work advances near-memory, PGAS-oriented architectures by enabling dynamic graph processing, providing dynamic task creation with actions, and offering scalable, asynchronous execution without global barriers, with practical implications for future large-scale, memory-bound graph analytics. The reported results indicate meaningful speedups and energy considerations when comparing topology variants and existing approaches, underscoring the potential of rhizome-based vertex abstractions for real-time, dynamic graph workloads.

Abstract

The paper provides a unified co-design of 1) a programming and execution model that allows spawning tasks from within the vertex data at runtime, 2) language constructs for \textit{actions} that send work to where the data resides, combining parallel expressiveness of local control objects (LCOs) to implement asynchronous graph processing primitives, 3) and an innovative vertex-centric data-structure, using the concept of Rhizomes, that parallelizes both the out and in-degree load of vertex objects across many cores and yet provides a single programming abstraction to the vertex objects. The data structure hierarchically parallelizes the out-degree load of vertices and the in-degree load laterally. The rhizomes internally communicate and remain consistent, using event-driven synchronization mechanisms, to provide a unified and correct view of the vertex. Simulated experimental results show performance gains for BFS, SSSP, and Page Rank on large chip sizes for the tested input graph datasets containing highly skewed degree distribution. The improvements come from the ability to express and create fine-grain dynamic computing task in the form of \textit{actions}, language constructs that aid the compiler to generate code that the runtime system uses to optimally schedule tasks, and the data structure that shares both in and out-degree compute workload among memory-processing elements.

Rhizomes and Diffusions for Processing Highly Skewed Graphs on Fine-Grain Message-Driven Systems

TL;DR

The paper tackles the challenge of exploiting latent, fine-grained parallelism in highly skewed graphs by co-designing a diffusive, asynchronous programming model, an action-based language for spawning work at data locality, and a novel vertex-centric data structure (Rhizomes) that distributes both out-degree and in-degree workloads across many cores. The approach hinges on RPVOs that form recursive hierarchies for vertex data and ghost nodes, augmented by rhizome-links that share in-degree loads while preserving a coherent global view via event-driven synchronization (rhizome-collapse) and Local Control Objects (LCOs) for fine-grained coordination. Through a CCA-inspired simulator, the authors demonstrate performance gains for BFS, SSSP, and PageRank on graphs with highly skewed degree distributions, and show how diffusive pruning and throttling mitigate network congestion while preserving compute occupancy. The work advances near-memory, PGAS-oriented architectures by enabling dynamic graph processing, providing dynamic task creation with actions, and offering scalable, asynchronous execution without global barriers, with practical implications for future large-scale, memory-bound graph analytics. The reported results indicate meaningful speedups and energy considerations when comparing topology variants and existing approaches, underscoring the potential of rhizome-based vertex abstractions for real-time, dynamic graph workloads.

Abstract

The paper provides a unified co-design of 1) a programming and execution model that allows spawning tasks from within the vertex data at runtime, 2) language constructs for \textit{actions} that send work to where the data resides, combining parallel expressiveness of local control objects (LCOs) to implement asynchronous graph processing primitives, 3) and an innovative vertex-centric data-structure, using the concept of Rhizomes, that parallelizes both the out and in-degree load of vertex objects across many cores and yet provides a single programming abstraction to the vertex objects. The data structure hierarchically parallelizes the out-degree load of vertices and the in-degree load laterally. The rhizomes internally communicate and remain consistent, using event-driven synchronization mechanisms, to provide a unified and correct view of the vertex. Simulated experimental results show performance gains for BFS, SSSP, and Page Rank on large chip sizes for the tested input graph datasets containing highly skewed degree distribution. The improvements come from the ability to express and create fine-grain dynamic computing task in the form of \textit{actions}, language constructs that aid the compiler to generate code that the runtime system uses to optimally schedule tasks, and the data structure that shares both in and out-degree compute workload among memory-processing elements.
Paper Structure (19 sections, 2 equations, 10 figures, 1 table)

This paper contains 19 sections, 2 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: An AM-CCA chip with many processing elements.
  • Figure 2: Vertex-Centric Data Structures: (a) a simple vertex having all its out-edges stored in a list and being pointed by all its in-edges, (b) the same vertex but with only its out-edges partitioned hierarchically, (c) the same vertex with not only its out-edges partitioned hierarchically but also its in-edges partitioned rhizomatically.
  • Figure 3: score : (AND Float), an AND Gate LCO of Float type, as an exemplar shows the internal state of the AND Gate LCO object (per RPVO) as it is being used to provide rhizome consistency for Page Rank score for any given single vertex. RPVOs send out their score over the rhizome-link. RPVOs receive score from other RPVOs over the rhizome-link. rhizome has been collapsed, the associated action is triggered locally at each RPVO, and the score AND Gate is reset.
  • Figure 4: Vertex object allocation policy: (a) Localize ghost vertices in Compute Cells nearby, (b) No regard to locality of ghost vertices, (c) Disperse rhizomes to far away Compute Cells using random allocator while keeping ghost vertices localized using vicinity allocator.
  • Figure 5: A moment during the application run showing status per compute cell. There are $128 \times 128$ compute cells with per virtual channel buffer size of $4$ solving the BFS of the RMAT-$18$ graph.
  • ...and 5 more figures