Table of Contents
Fetching ...

Detrimental task execution patterns in mainstream OpenMP runtimes

Adam S. Tuft, Tobias Weinzierl, Michael Klemm

TL;DR

The paper tackles detrimental task execution patterns in mainstream OpenMP runtimes by analyzing a stationary black-hole simulation with Otter-based tracing. It identifies four problematic patterns—premature task activation, lack of embedded parallelism, unfair yields, and throughput-biased waiting—that harm the critical path. To address them, the authors propose prescriptive API extensions (e.g., deferrable tasks, priority-enabled taskloop, and latency/throughput taskyield qualifiers) and outline practical realizations that rely on modest runtime changes and task-priority management. The work aims to empower developers to control task scheduling more explicitly, enabling incremental performance improvements for task-heavy HPC codes while inviting broader evaluation across runtimes and domains.

Abstract

The OpenMP API offers both task-based and data-parallel concepts to scientific computing. While it provides descriptive and prescriptive annotations, it is in many places deliberately unspecific how to implement its annotations. As the predominant OpenMP implementations share design rationales, they introduce "quasi-standards how certain annotations behave. By means of a task-based astrophysical simulation code, we highlight situations where this "quasi-standard" reference behaviour introduces performance flaws. Therefore, we propose prescriptive clauses to constrain the OpenMP implementations. Simulated task traces uncover the clauses' potential, while a discussion of their realization highlights that they would manifest in rather incremental changes to any OpenMP runtime supporting task priorities.

Detrimental task execution patterns in mainstream OpenMP runtimes

TL;DR

The paper tackles detrimental task execution patterns in mainstream OpenMP runtimes by analyzing a stationary black-hole simulation with Otter-based tracing. It identifies four problematic patterns—premature task activation, lack of embedded parallelism, unfair yields, and throughput-biased waiting—that harm the critical path. To address them, the authors propose prescriptive API extensions (e.g., deferrable tasks, priority-enabled taskloop, and latency/throughput taskyield qualifiers) and outline practical realizations that rely on modest runtime changes and task-priority management. The work aims to empower developers to control task scheduling more explicitly, enabling incremental performance improvements for task-heavy HPC codes while inviting broader evaluation across runtimes and domains.

Abstract

The OpenMP API offers both task-based and data-parallel concepts to scientific computing. While it provides descriptive and prescriptive annotations, it is in many places deliberately unspecific how to implement its annotations. As the predominant OpenMP implementations share design rationales, they introduce "quasi-standards how certain annotations behave. By means of a task-based astrophysical simulation code, we highlight situations where this "quasi-standard" reference behaviour introduces performance flaws. Therefore, we propose prescriptive clauses to constrain the OpenMP implementations. Simulated task traces uncover the clauses' potential, while a discussion of their realization highlights that they would manifest in rather incremental changes to any OpenMP runtime supporting task priorities.
Paper Structure (14 sections, 4 figures, 2 algorithms)

This paper contains 14 sections, 4 figures, 2 algorithms.

Figures (4)

  • Figure 1: The Otter tool suite and Otter's trace-simulate-postprocess workflow.
  • Figure 2: [id=R1]Created vs. consumed [id=AT]enclave tasks on four threads (7623, 7629, 7630, 7631) over time [id=AT]for a single timestep. 7623, 7629 and 7631 produce tasks which are held in a task queue and then completed, i.e. consumed once the producing task has terminated. 7630 produces so many tasks that the producer task is suspended as further child tasks cannot be deferred. They are executed immediately. From 222s onwards, 7629 and 7631 start to process tasks, eventually steal from 7630 and hence allow 7630 eventually to stop interrupting the producer. Left: Created vs. consumed tasks on four threads over time. Right: Corresponding trace with grey bars illustrating the traversal threads (task producers), dots showing the creation of enclave child tasks, and vertical black bars denoting the actual enclave task execution. For immediate execution, the vertical bars are embedded directly into the producing traversal task. All data are recorded, i.e., not simulated.
  • Figure 3: [id=R1]Trace for task execution pattern from Figure \ref{['figure:task-recommendation:histogram']} with grey bars illustrating the traversal [id=AT]tasksthreads (task producers), dots showing the creation of enclave child tasks, and [id=AT]narrowvertical black bars denoting the actual enclave task execution. [id=AT]Bars embedded in the traversal task on thread 7630 show the task being suspended when the thread immediately executes an enclave task.For immediate execution, the vertical bars are embedded directly into the producing traversal task. All data are recorded, i.e., not simulated. The trace illustrates that [id=AT]the traversal task of thread 76307630 is on the critical path. Not deferring child tasks prolongs this path. Left: Created vs. consumed tasks on four threads over time. Right: Corresponding trace with grey bars illustrating the traversal threads (task producers), dots showing the creation of enclave child tasks, and vertical black bars denoting the actual enclave task execution. For immediate execution, the vertical bars are embedded directly into the producing traversal task. All data are recorded, i.e., not simulated.
  • Figure 4: [id=R1] Timeline of a single time step. The red bars highlight where the critical task runs into some embedded parallel for constructs. Recorded timings augmented by postprocessing data (critical path).