Table of Contents
Fetching ...

An Asynchronous Many-Task Algorithm for Unstructured $S_{N}$ Transport on Shared Memory Systems

Alex Elwood, Tom Deakin, Justin Lovegrove, Chris Nelson

TL;DR

This work tackles the scalability of $S_N$ transport on unstructured meshes by first auditing the Bulk Synchronous Parallel (BSP) implementation in UnSNAP on modern many-core CPUs and GPUs, identifying synchronization and cache-dependency bottlenecks. It then introduces an Asynchronous Many-Task (AMT) parallelization for the shared-memory domain, using a recursive work-first task generation with per-element atomic counters to schedule downwind work and distributed runtimes for task scheduling. Across multiple hardware platforms, the AMT approach achieves notable speedups over BSP, particularly for lower-order finite elements, and reduces synchronization overhead while maintaining or improving cache utilization. The results indicate that AMT can significantly improve utilization on high-core-count CPUs and motivate extending the AMT strategy to GPUs and additional distributed-memory schemes. Overall, the AMT method enhances scalability of unstructured $S_N$ transport solvers and offers a practical path toward efficient, fine-grained parallelism on contemporary architectures.

Abstract

Discrete ordinates $S_N$ transport solvers on unstructured meshes pose a challenge to scale due to complex data dependencies, memory access patterns and a high-dimensional domain. In this paper, we review the performance bottlenecks within the shared memory parallelization scheme of an existing transport solver on modern many-core architectures with high core counts. With this analysis, we then survey the performance of this solver across a variety of compute hardware. We then present a new Asynchronous Many-Task (AMT) algorithm for shared memory parallelism, present results showing an increase in computational performance over the existing method, and evaluate why performance is improved.

An Asynchronous Many-Task Algorithm for Unstructured $S_{N}$ Transport on Shared Memory Systems

TL;DR

This work tackles the scalability of transport on unstructured meshes by first auditing the Bulk Synchronous Parallel (BSP) implementation in UnSNAP on modern many-core CPUs and GPUs, identifying synchronization and cache-dependency bottlenecks. It then introduces an Asynchronous Many-Task (AMT) parallelization for the shared-memory domain, using a recursive work-first task generation with per-element atomic counters to schedule downwind work and distributed runtimes for task scheduling. Across multiple hardware platforms, the AMT approach achieves notable speedups over BSP, particularly for lower-order finite elements, and reduces synchronization overhead while maintaining or improving cache utilization. The results indicate that AMT can significantly improve utilization on high-core-count CPUs and motivate extending the AMT strategy to GPUs and additional distributed-memory schemes. Overall, the AMT method enhances scalability of unstructured transport solvers and offers a practical path toward efficient, fine-grained parallelism on contemporary architectures.

Abstract

Discrete ordinates transport solvers on unstructured meshes pose a challenge to scale due to complex data dependencies, memory access patterns and a high-dimensional domain. In this paper, we review the performance bottlenecks within the shared memory parallelization scheme of an existing transport solver on modern many-core architectures with high core counts. With this analysis, we then survey the performance of this solver across a variety of compute hardware. We then present a new Asynchronous Many-Task (AMT) algorithm for shared memory parallelism, present results showing an increase in computational performance over the existing method, and evaluate why performance is improved.

Paper Structure

This paper contains 17 sections, 3 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Benchmarks for a BSP implementation of UnSNAP across a variety of compute hardware.
  • Figure 2: Diagram showing how UnSNAP parallelizes the spatial domain for an angular sweep on an example mesh (left) with Bulk Synchronous Parallelism (centre) and Asynchronous Many-Task (right).
  • Figure 3: Distributions of compute time for a single unknown in the angular flux array (grind time). Metrics taken from a $16\times 16 \times 16$ grid with one angle per octant, 16 energy groups, one inner iteration, and one outer iteration. This totals 524288 grind times in each case.
  • Figure 4: A comparison between the number of elements within each tree level and the number of processor threads/cores.
  • Figure 5: Speedups between the previous BSP method and new AMT method for UnSNAP across compute nodes from three different hardware vendors.
  • ...and 1 more figures