Table of Contents
Fetching ...

Understanding the Communication Needs of Asynchronous Many-Task Systems -- A Case Study of HPX+LCI

Jiakun Yan, Hartmut Kaiser, Marc Snir

TL;DR

This paper analyzes the HPX AMT system to identify fundamental communication needs that differ from bulk-synchronous models. It introduces the HPX parcelport abstraction and implements a new LCI-based parcelport that maps AMT communications directly to native network primitives, featuring one-sided operations, queue-based completions, explicit progress, and resource replication. Empirical results show up to 50x improvements in microbenchmarks and up to 2x real-world application speedups over the MPI parcelport, with careful ablation studies revealing the relative value and interactions of each technique. The work provides concrete guidelines for designing AMT-friendly communication libraries and demonstrates how deeper integration with the native network layer can yield robust performance across heterogeneous extreme-scale systems.

Abstract

Asynchronous Many-Task (AMT) systems offer a potential solution for efficiently programming complicated scientific applications on extreme-scale heterogeneous architectures. However, they exhibit different communication needs from traditional bulk-synchronous parallel (BSP) applications, posing new challenges for underlying communication libraries. This work systematically studies the communication needs of AMTs and explores how communication libraries can be structured to better satisfy them through a case study of a real-world AMT system, HPX. We first examine its communication stack layout and formalize the communication abstraction that underlying communication libraries need to support. We then analyze its current MPI backend (parcelport) and identify four categories of needs that are not typical in the BSP model and are not well covered by the MPI standard. To bridge these gaps, we design from the native network layer and incorporate various techniques, including one-sided communication, queue-based completion notification, explicit progressing, and different ways of resource contention mitigation, in a new parcelport with an experimental communication library, LCI. Overall, the resulting LCI parcelport outperforms the existing MPI parcelport with up to 50x in microbenchmarks and 2x in a real-world application. Using it as a testbed, we design LCI parcelport variants to quantify the performance contributions of each technique. This work combines conceptual analysis and experiment results to offer a practical guideline for the future development of communication libraries and AMT communication layers.

Understanding the Communication Needs of Asynchronous Many-Task Systems -- A Case Study of HPX+LCI

TL;DR

This paper analyzes the HPX AMT system to identify fundamental communication needs that differ from bulk-synchronous models. It introduces the HPX parcelport abstraction and implements a new LCI-based parcelport that maps AMT communications directly to native network primitives, featuring one-sided operations, queue-based completions, explicit progress, and resource replication. Empirical results show up to 50x improvements in microbenchmarks and up to 2x real-world application speedups over the MPI parcelport, with careful ablation studies revealing the relative value and interactions of each technique. The work provides concrete guidelines for designing AMT-friendly communication libraries and demonstrates how deeper integration with the native network layer can yield robust performance across heterogeneous extreme-scale systems.

Abstract

Asynchronous Many-Task (AMT) systems offer a potential solution for efficiently programming complicated scientific applications on extreme-scale heterogeneous architectures. However, they exhibit different communication needs from traditional bulk-synchronous parallel (BSP) applications, posing new challenges for underlying communication libraries. This work systematically studies the communication needs of AMTs and explores how communication libraries can be structured to better satisfy them through a case study of a real-world AMT system, HPX. We first examine its communication stack layout and formalize the communication abstraction that underlying communication libraries need to support. We then analyze its current MPI backend (parcelport) and identify four categories of needs that are not typical in the BSP model and are not well covered by the MPI standard. To bridge these gaps, we design from the native network layer and incorporate various techniques, including one-sided communication, queue-based completion notification, explicit progressing, and different ways of resource contention mitigation, in a new parcelport with an experimental communication library, LCI. Overall, the resulting LCI parcelport outperforms the existing MPI parcelport with up to 50x in microbenchmarks and 2x in a real-world application. Using it as a testbed, we design LCI parcelport variants to quantify the performance contributions of each technique. This work combines conceptual analysis and experiment results to offer a practical guideline for the future development of communication libraries and AMT communication layers.

Paper Structure

This paper contains 31 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Communication profile of the "rotating star" scenario of Octo-Tiger running on top of HPX.
  • Figure 2: AMT Communication Stack
  • Figure 3: Microbenchmark results of 8B/16KiB messages on Expanse: the solid lines/left axis show the absolute message rates/Latency; the dotted lines/right axis show the relative speedup. mpi_a means MPI with aggregation.
  • Figure 4: Execution time and performance ratio of Octo-Tiger, strong scaling up to 128/256 nodes on Expanse/Frontera.
  • Figure 5: Performance comparison of Expanse (Infiniband) and Delta (Slingshot-11) with the HPX LCI parcelport.
  • ...and 4 more figures