Table of Contents
Fetching ...

Scalable and Consistent Graph Neural Networks for Distributed Mesh-based Data-driven Modeling

Shivam Barwey, Riccardo Balin, Bethany Lusch, Saumil Patel, Ramesh Balakrishnan, Pinaki Pal, Romit Maulik, Venkatram Vishwanath

TL;DR

The impact of consistency on the scalability of mesh-based GNNs is studied, demonstrating efficient scaling in consistent GNNs for up to O(1B) graph nodes on the Frontier exascale supercomputer.

Abstract

This work develops a distributed graph neural network (GNN) methodology for mesh-based modeling applications using a consistent neural message passing layer. As the name implies, the focus is on enabling scalable operations that satisfy physical consistency via halo nodes at sub-graph boundaries. Here, consistency refers to the fact that a GNN trained and evaluated on one rank (one large graph) is arithmetically equivalent to evaluations on multiple ranks (a partitioned graph). This concept is demonstrated by interfacing GNNs with NekRS, a GPU-capable exascale CFD solver developed at Argonne National Laboratory. It is shown how the NekRS mesh partitioning can be linked to the distributed GNN training and inference routines, resulting in a scalable mesh-based data-driven modeling workflow. We study the impact of consistency on the scalability of mesh-based GNNs, demonstrating efficient scaling in consistent GNNs for up to O(1B) graph nodes on the Frontier exascale supercomputer.

Scalable and Consistent Graph Neural Networks for Distributed Mesh-based Data-driven Modeling

TL;DR

The impact of consistency on the scalability of mesh-based GNNs is studied, demonstrating efficient scaling in consistent GNNs for up to O(1B) graph nodes on the Frontier exascale supercomputer.

Abstract

This work develops a distributed graph neural network (GNN) methodology for mesh-based modeling applications using a consistent neural message passing layer. As the name implies, the focus is on enabling scalable operations that satisfy physical consistency via halo nodes at sub-graph boundaries. Here, consistency refers to the fact that a GNN trained and evaluated on one rank (one large graph) is arithmetically equivalent to evaluations on multiple ranks (a partitioned graph). This concept is demonstrated by interfacing GNNs with NekRS, a GPU-capable exascale CFD solver developed at Argonne National Laboratory. It is shown how the NekRS mesh partitioning can be linked to the distributed GNN training and inference routines, resulting in a scalable mesh-based data-driven modeling workflow. We study the impact of consistency on the scalability of mesh-based GNNs, demonstrating efficient scaling in consistent GNNs for up to O(1B) graph nodes on the Frontier exascale supercomputer.
Paper Structure (10 sections, 6 equations, 8 figures, 2 tables)

This paper contains 10 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of workflow, with components from existing tools highlighted in red and contributions of this work highlighted in blue. Code for the NekRS interface and consistent GNN implementation is openly available at the following GitHub repository: https://github.com/argonne-lcf/nekRS-ML/tree/GNN.
  • Figure 2: Illustration of element-based discretizations and graph generation. Left plots show elements at increasing polynomial orders per the Gauss-Legendre-Lobato (GLL) quadrature of NekRS nekrs, with black markers denoting spatial quadrature points and blue lines denoting element boundaries. Right plots show corresponding graph representations produced after taking quadrature points as nodes (black markers) and generating edges (black lines) to connect neighboring nodes.
  • Figure 3: (a) Full $R=1$ graph of mesh composed of 8 total elements, each characterized by polynomial order $p=5$ (refer to Fig. \ref{['fig:element_graphgen']}). Blue markers indicate local coincident nodes and element boundaries -- all other nodes are not shown for ease of visualization. (b) Corresponding distributed $R=2$ graph, highlighting the construction of non-local coincident nodes (red markers). Arrows between sub-graphs indicate communication directions required to enforce consistency at these nodes. (c) Reduced distributed graph produced by collapsing/consolidating local coincident nodes. Block between (b) and (c) illustrates the node collapse procedure between two neighboring elements on a local graph in 1D.
  • Figure 4: (Top) Schematic of halo nodes for an $R=2$ distributed graph consisting of two $p=1$ elements. (Bottom) Visualization of node attribute matrices involved in the exchange.
  • Figure 5: Visualization of consistent NMP layer steps on a small $R=4$ distributed graph composed of four $p=1$ elements for illustrative purposes. 2D projections are shown for ease of visualization. Node colors are same as Fig. \ref{['fig:halo_schematic']} legend. (a) Illustration of edge update (blue arrows, Eq. \ref{['eq:cmp_edge_update']}) and local edge aggregation (purple arrows, Eq. \ref{['eq:cmp_edge_aggr']}) (b) Halo exchange over the local edge aggregates between neighboring ranks, populating the respective halo nodes (Eq. \ref{['eq:cmp_halo_swap']}). Swap directions shown in blue arrows for exchanges involving Rank 1 only for visual clarity. (c) Synchronization step of the exchanged aggregates (Eq. \ref{['eq:cmp_sync']}). Summations of the aggregated features ${\bf a}_r^i$ occur locally among coincident-halo node pairs sharing the same global index (indicated by regions in blue boxes).
  • ...and 3 more figures