Table of Contents
Fetching ...

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Gabin Schieffer, Ruimin Shi, Stefano Markidis, Andreas Herten, Jennifer Faj, Ivy Peng

TL;DR

In a single-node setup with four GPUs, it is shown that direct peer-to-peer memory accesses between GPUs and utilization of the RCCL library outperform MPI-based solutions in terms of memory/communication latency and bandwidth.

Abstract

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a test and evaluation methodology for characterizing the performance of data movements on multi-GPU systems, stressing different communication options on AMD MI250X GPUs, including point-to-point and collective communication, and memory allocation strategies between GPUs, as well as the host CPU. In a single-node setup with four GPUs, we show that direct peer-to-peer memory accesses between GPUs and utilization of the RCCL library outperform MPI-based solutions in terms of memory/communication latency and bandwidth. Our test and evaluation method serves as a base for validating memory and communication strategies on a system and improving applications on AMD multi-GPU computing systems.

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

TL;DR

In a single-node setup with four GPUs, it is shown that direct peer-to-peer memory accesses between GPUs and utilization of the RCCL library outperform MPI-based solutions in terms of memory/communication latency and bandwidth.

Abstract

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a test and evaluation methodology for characterizing the performance of data movements on multi-GPU systems, stressing different communication options on AMD MI250X GPUs, including point-to-point and collective communication, and memory allocation strategies between GPUs, as well as the host CPU. In a single-node setup with four GPUs, we show that direct peer-to-peer memory accesses between GPUs and utilization of the RCCL library outperform MPI-based solutions in terms of memory/communication latency and bandwidth. Our test and evaluation method serves as a base for validating memory and communication strategies on a system and improving applications on AMD multi-GPU computing systems.
Paper Structure (19 sections, 12 figures, 2 tables)

This paper contains 19 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Overview of a multi-GPU compute node, totaling eight GCDs, distributed onto four physical MI250X GPUs, coupled with a single-socket AMD 3rd generation EPYC CPU. Adapted from frontier-user.
  • Figure 2: Peak achieved host-to-device bandwidth in our experiments, for direct GPU access to CPU memory with unified memory, and explicit data movements with hipMemcpy
  • Figure 3: Host-to-device memory bandwidth at increased data transfer sizes, measured with Comm|Scope. The maximum for each interface is indicated in boxes.
  • Figure 4: Total bidirectional CPU-GPU bandwidth, measured using STREAM copy kernels, parallelly-running on one or two GCDs. For the dual-GCDs cases, the two GCDs are either located on a single physical GPU (same GPU), or on two distinct physical GPUs (spread). The achieved percentage of theoretical bandwidth is presented.
  • Figure 5: Total bidirectional CPU-GPU bandwidth, measured using STREAM copy kernels, parallelly-running on one to eight GCDs. The achieved percentage of theoretical bandwidth is presented.
  • ...and 7 more figures