Table of Contents
Fetching ...

A New Execution Model and Executor for Adaptively Optimizing the Performance of Parallel Algorithms Using HPX Runtime System

Karame Mohammadiporshokooh, Steven R. Brandt, Hartmut Kaiser

TL;DR

This paper develops a runtime-adaptive execution model for HPX to optimize core usage and chunking in parallel algorithms, addressing the overheads arising from static resource allocation. By modeling execution with $T_0$ (overhead) and $T_1$ (sequential work) and deriving an Overhead Law, it computes optimal core counts and chunk sizes that adapt in real time based on measured loop times, implemented as the adaptive_core_chunk_size execution parameter. The implementation integrates with HPX through customization points (measure_iteration, processing_units_count, get_chunk_size) and execution policies, enabling seamless use via the .on() and .with() interfaces. Across diverse workloads and architectures, the adaptive executor yields consistent speedups over static configurations, with larger gains on compute-bound tasks and robust performance improvements for a wide range of workloads, demonstrating practical impact for performance optimization without increasing algorithmic complexity. The work suggests a path toward broadly applicable, runtime-driven optimization in parallel runtimes while maintaining a familiar C++ executors API.

Abstract

Developing parallel algorithms efficiently requires careful management of concurrency across diverse hardware architectures. C++ executors provide a standardized interface that simplifies the development process, allowing developers to write portable and uniform code. However, in some cases, they may not fully leverage hardware capabilities or optimally allocate resources for specific workloads, leading to potential performance inefficiencies. Building on our earlier conference paper [ Adaptively Optimizing the Performance of HPX's Parallel algorithms], which introduces a preliminary strategy based on cores and chunking (workload), and integrated it into HPX's executor API, that dynamically optimizes for workload distribution and resource allocation, based on runtime metrics and overheads, this paper, introduces a more detailed model of that strategy. It evaluates the efficiency of this implementation (as an HPX executor) across a wide range of compute-bound and memory-bound workloads on different architectures and with different algorithms. The results show consistent speedups across all tests, configurations, and workloads studied, offering improved performance through a familiar and user-friendly c++ executors API. Additionally, the paper highlights how runtime-driven executor adaptation can simplify performance optimization without increasing the complexity of algorithm development.

A New Execution Model and Executor for Adaptively Optimizing the Performance of Parallel Algorithms Using HPX Runtime System

TL;DR

This paper develops a runtime-adaptive execution model for HPX to optimize core usage and chunking in parallel algorithms, addressing the overheads arising from static resource allocation. By modeling execution with (overhead) and (sequential work) and deriving an Overhead Law, it computes optimal core counts and chunk sizes that adapt in real time based on measured loop times, implemented as the adaptive_core_chunk_size execution parameter. The implementation integrates with HPX through customization points (measure_iteration, processing_units_count, get_chunk_size) and execution policies, enabling seamless use via the .on() and .with() interfaces. Across diverse workloads and architectures, the adaptive executor yields consistent speedups over static configurations, with larger gains on compute-bound tasks and robust performance improvements for a wide range of workloads, demonstrating practical impact for performance optimization without increasing algorithmic complexity. The work suggests a path toward broadly applicable, runtime-driven optimization in parallel runtimes while maintaining a familiar C++ executors API.

Abstract

Developing parallel algorithms efficiently requires careful management of concurrency across diverse hardware architectures. C++ executors provide a standardized interface that simplifies the development process, allowing developers to write portable and uniform code. However, in some cases, they may not fully leverage hardware capabilities or optimally allocate resources for specific workloads, leading to potential performance inefficiencies. Building on our earlier conference paper [ Adaptively Optimizing the Performance of HPX's Parallel algorithms], which introduces a preliminary strategy based on cores and chunking (workload), and integrated it into HPX's executor API, that dynamically optimizes for workload distribution and resource allocation, based on runtime metrics and overheads, this paper, introduces a more detailed model of that strategy. It evaluates the efficiency of this implementation (as an HPX executor) across a wide range of compute-bound and memory-bound workloads on different architectures and with different algorithms. The results show consistent speedups across all tests, configurations, and workloads studied, offering improved performance through a familiar and user-friendly c++ executors API. Additionally, the paper highlights how runtime-driven executor adaptation can simplify performance optimization without increasing the complexity of algorithm development.

Paper Structure

This paper contains 16 sections, 11 equations, 7 figures.

Figures (7)

  • Figure 1: Array size vs. speedup when using different numbers of processing units (cores) for parallelizing the finite-difference algorithm for different numbers of chunks-per-core, $C$. For comparison, the value of $C$ in these runs behaves like the chunk size argument to OpenMP's static scheduling algorithm.
  • Figure 2: Speedup measured for the adjacent difference algorithm across a range of core counts and input sizes. We compare executions (for different numbers of cores) with the results measured when using the new adaptive_core_chunk_size(acc) (red line).
  • Figure 3: Speedup across various core counts for a compute-bound use case when using the default static parameters compared to using the new adaptive_core_chunk_size(acc)(red line) across varying input sizes on Intel hardware.
  • Figure 4: Speedup across various core counts for a compute-bound use case when using the default static parameters compared to using the new adaptive_core_chunk_size(acc)(red line) across varying input sizes on AMD hardware.
  • Figure 5: Speedup across various core counts for a compute-bound use case when using the default static parameters compared to using the new adaptive_core_chunk_size(acc)(red line) across varying input sizes on RISC-V hardware. (New)
  • ...and 2 more figures