Table of Contents
Fetching ...

Optimizing ML Concurrent Computation and Communication with GPU DMA Engines

Anirudha Agrawal, Shaizeen Aga, Suchita Pati, Mahzabeen Islam

TL;DR

This work analyzes concurrent computation and communication (C3) on GPUs, revealing substantial interference that limits realized speedups to about 21% of the ideal. It first characterizes C3 through GEMM and ML collectives on MI300X, then improves performance via schedule prioritization and compute-unit partitioning to reach roughly 42% of ideal. To further close the gap, it introduces ConCCL, DMA-based all-gather and all-to-all collectives, which raise realized speedups to about 72% of ideal (up to 1.67x). The results argue for advancing GPU DMA engines and DMA-based collectives as a practical path to significantly accelerate C3 in large-scale ML workloads.

Abstract

Concurrent computation and communication (C3) is a pervasive paradigm in ML and other domains, making its performance optimization crucial. In this paper, we carefully characterize C3 in ML on GPUs, which are most widely deployed for ML training and inference. We observe that while C3 leads to performance uplifts, the uplifts are far lower than ideal speedups (serial computation and communication versus maximum of computation or communication; all times from isolated executions). That is, C3 on average achieves only 21% of ideal speedup. This is so, due to known challenges of compute and memory interference between concurrent GPU kernels (that is, sharing of GPU's compute units, caches and HBM). To attain better performance for C3, first, we evaluate dual strategies of schedule prioritization and careful resource partitioning of compute units on GPUs to push performance attained with C3 (on average 42% of ideal speedup). We also provide heuristics that can guide a runtime while employing these strategies. To further enhance C3 performance, we propose to mitigate C3 interference by offloading communication tasks to the GPU's DMA engines. To this end, we build concurrent communication collectives (ConCCL) proof-of-concepts that harness DMA engines for communication. We show how ConCCL considerably closes the gap between realized and ideal speedup for C3 (on average 72% of ideal speedup is realized, up to 1.67x speedup). Overall, our work makes a strong case for GPU DMA engine advancements to better support C3 on GPUs.

Optimizing ML Concurrent Computation and Communication with GPU DMA Engines

TL;DR

This work analyzes concurrent computation and communication (C3) on GPUs, revealing substantial interference that limits realized speedups to about 21% of the ideal. It first characterizes C3 through GEMM and ML collectives on MI300X, then improves performance via schedule prioritization and compute-unit partitioning to reach roughly 42% of ideal. To further close the gap, it introduces ConCCL, DMA-based all-gather and all-to-all collectives, which raise realized speedups to about 72% of ideal (up to 1.67x). The results argue for advancing GPU DMA engines and DMA-based collectives as a practical path to significantly accelerate C3 in large-scale ML workloads.

Abstract

Concurrent computation and communication (C3) is a pervasive paradigm in ML and other domains, making its performance optimization crucial. In this paper, we carefully characterize C3 in ML on GPUs, which are most widely deployed for ML training and inference. We observe that while C3 leads to performance uplifts, the uplifts are far lower than ideal speedups (serial computation and communication versus maximum of computation or communication; all times from isolated executions). That is, C3 on average achieves only 21% of ideal speedup. This is so, due to known challenges of compute and memory interference between concurrent GPU kernels (that is, sharing of GPU's compute units, caches and HBM). To attain better performance for C3, first, we evaluate dual strategies of schedule prioritization and careful resource partitioning of compute units on GPUs to push performance attained with C3 (on average 42% of ideal speedup). We also provide heuristics that can guide a runtime while employing these strategies. To further enhance C3 performance, we propose to mitigate C3 interference by offloading communication tasks to the GPU's DMA engines. To this end, we build concurrent communication collectives (ConCCL) proof-of-concepts that harness DMA engines for communication. We show how ConCCL considerably closes the gap between realized and ideal speedup for C3 (on average 72% of ideal speedup is realized, up to 1.67x speedup). Overall, our work makes a strong case for GPU DMA engine advancements to better support C3 on GPUs.

Paper Structure

This paper contains 40 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Baseline C3 (left) and C3 with ConCCL via DMA offloads (right).
  • Figure 2: State-of-art AMD Instinct™ MI300X.
  • Figure 3: Offloading a data-transfer to DMA in MI300X.
  • Figure 4: C3 taxonomy.
  • Figure 5: (a) GEMM kernel slowdown due to loss of compute units (CUs) in GPU. (b) All-gather, (c) All-to-all kernel slowdown with specific # CUs assigned vs. default CUs (All-gather default #CUs=64, All-to-all default #CUs=56). For single partition MI300X with eight XCDs, eight is the minimum number of CUs that can be assigned to a kernel.
  • ...and 5 more figures