An Asynchronous Many-Task Algorithm for Unstructured $S_{N}$ Transport on Shared Memory Systems
Alex Elwood, Tom Deakin, Justin Lovegrove, Chris Nelson
TL;DR
This work tackles the scalability of $S_N$ transport on unstructured meshes by first auditing the Bulk Synchronous Parallel (BSP) implementation in UnSNAP on modern many-core CPUs and GPUs, identifying synchronization and cache-dependency bottlenecks. It then introduces an Asynchronous Many-Task (AMT) parallelization for the shared-memory domain, using a recursive work-first task generation with per-element atomic counters to schedule downwind work and distributed runtimes for task scheduling. Across multiple hardware platforms, the AMT approach achieves notable speedups over BSP, particularly for lower-order finite elements, and reduces synchronization overhead while maintaining or improving cache utilization. The results indicate that AMT can significantly improve utilization on high-core-count CPUs and motivate extending the AMT strategy to GPUs and additional distributed-memory schemes. Overall, the AMT method enhances scalability of unstructured $S_N$ transport solvers and offers a practical path toward efficient, fine-grained parallelism on contemporary architectures.
Abstract
Discrete ordinates $S_N$ transport solvers on unstructured meshes pose a challenge to scale due to complex data dependencies, memory access patterns and a high-dimensional domain. In this paper, we review the performance bottlenecks within the shared memory parallelization scheme of an existing transport solver on modern many-core architectures with high core counts. With this analysis, we then survey the performance of this solver across a variety of compute hardware. We then present a new Asynchronous Many-Task (AMT) algorithm for shared memory parallelism, present results showing an increase in computational performance over the existing method, and evaluate why performance is improved.
