A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

Aleix Boné; Alejandro Aguirre; David Álvarez; Pedro J. Martinez-Ferrer; Vicenç Beltran

A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

Aleix Boné, Alejandro Aguirre, David Álvarez, Pedro J. Martinez-Ferrer, Vicenç Beltran

TL;DR

This work proposes reusing a task-based data-flow methodology together with Task-Aware APIs (TA-libs) to overcome limitations and facilitate the seamless integration of multiple accelerator programming models, while still leveraging the best-in-class kernels offered by each API.

Abstract

Heterogeneous nodes that combine multi-core CPUs with diverse accelerators are rapidly becoming the norm in both high-performance computing (HPC) and AI infrastructures. Exploiting these platforms, however, requires orchestrating several low-level accelerator APIs such as CUDA, SYCL, and Triton. In some occasions they can be combined with optimized vendor math libraries: e.g., cuBLAS and oneAPI. Each API or library introduces its own abstractions, execution semantics, and synchronization mechanisms. Combining them within a single application is therefore error-prone and labor-intensive. We propose reusing a task-based data-flow methodology together with Task-Aware APIs (TA-libs) to overcome these limitations and facilitate the seamless integration of multiple accelerator programming models, while still leveraging the best-in-class kernels offered by each API. Applications are expressed as a directed acyclic graph (DAG) of host tasks and device kernels managed by an OpenMP/OmpSs-2 runtime. We introduce Task-Aware SYCL (TASYCL) and leverage Task-Aware CUDA (TACUDA), which elevate individual accelerator invocations to first-class tasks. When multiple native runtimes coexist on the same multi-core CPU, they contend for threads, leading to oversubscription and performance variability. To address this, we unify their thread management under the nOS-V tasking and threading library, to which we contribute a new port of the PoCL (Portable OpenCL) runtime. These results demonstrate that task-aware libraries, coupled with the nOS-V library, enable a single application to harness multiple accelerator programming models transparently and efficiently. The proposed methodology is immediately applicable to current heterogeneous nodes and is readily extensible to future systems that integrate even richer combinations of CPUs, GPUs, FPGAs, and AI accelerators.

A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

TL;DR

Abstract

Paper Structure (26 sections, 8 figures)

This paper contains 26 sections, 8 figures.

Introduction
Background
Vendor APIs and libraries
Portable APIs and libraries
OmpSs-2 and OpenMP task-based data-flow model
Task-aware libraries
GPT-2 pre-training
GPT-2 fork-join parallelisation
GPT-2 task parallelization
HPCCG benchmark
Related work
Automatic data management and directory/cache mechanisms
Portable programming models
Interoperability across programming models
Task-aware libraries for heterogeneous computing
...and 11 more sections

Figures (8)

Figure 1: Modified architecture of the GPT-2 model used in this work.
Figure 2: Application trace of Code \ref{['code:cuda-openmp-nota']} (a) without and (b) with task-awareness. Dashed lines represent task dependencies, while dotted lines represent device-host synchronization.
Figure 3: Integration of task-aware libraries within a task-based data-flow programming model.
Figure 4: GPU evaluation of GPT-2. We compare monolithic and task-based executions ($N=4$ tasks) for four kernel programming models (OpenMP offload, SYCL, Triton, and a combination of all the others "Mixed") both with and without cuBLAS. Each bar reports the top three kernels by duration for a single iteration of the model (context length of 256 tokens and 1.5 B parameters) executed on a single H100 GPU. Bars are stacked with the time contribution of each kernel.
Figure 5: GPU breakdown for the HPCCG application with $256^{3}$ points executed on a single H100 GPU. For each kernel we report the three most time-consuming kernels, stacked to show their contribution to a full iteration.
...and 3 more figures

A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

TL;DR

Abstract

A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

Authors

TL;DR

Abstract

Table of Contents

Figures (8)