Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Gabin Schieffer; Ruimin Shi; Stefano Markidis; Andreas Herten; Jennifer Faj; Ivy Peng

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Gabin Schieffer, Ruimin Shi, Stefano Markidis, Andreas Herten, Jennifer Faj, Ivy Peng

TL;DR

In a single-node setup with four GPUs, it is shown that direct peer-to-peer memory accesses between GPUs and utilization of the RCCL library outperform MPI-based solutions in terms of memory/communication latency and bandwidth.

Abstract

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a test and evaluation methodology for characterizing the performance of data movements on multi-GPU systems, stressing different communication options on AMD MI250X GPUs, including point-to-point and collective communication, and memory allocation strategies between GPUs, as well as the host CPU. In a single-node setup with four GPUs, we show that direct peer-to-peer memory accesses between GPUs and utilization of the RCCL library outperform MPI-based solutions in terms of memory/communication latency and bandwidth. Our test and evaluation method serves as a base for validating memory and communication strategies on a system and improving applications on AMD multi-GPU computing systems.

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

TL;DR

Abstract

Paper Structure (19 sections, 12 figures, 2 tables)

This paper contains 19 sections, 12 figures, 2 tables.

Introduction
Background
Infinity Fabric Interconnect
HIP Programming Model
Memory Management
Testing Methodology
CPU-GPU Communication
Peak Achievable Bandwidth
GPU-Aware Memory Placement
Multi-GPU Bandwidth
Point-to-Point GPU Communication
Explicit Peer-to-Peer Data Movements
Latency
Bandwidth
Direct Memory Access
...and 4 more sections

Figures (12)

Figure 1: Overview of a multi-GPU compute node, totaling eight GCDs, distributed onto four physical MI250X GPUs, coupled with a single-socket AMD 3rd generation EPYC CPU. Adapted from frontier-user.
Figure 2: Peak achieved host-to-device bandwidth in our experiments, for direct GPU access to CPU memory with unified memory, and explicit data movements with hipMemcpy
Figure 3: Host-to-device memory bandwidth at increased data transfer sizes, measured with Comm|Scope. The maximum for each interface is indicated in boxes.
Figure 4: Total bidirectional CPU-GPU bandwidth, measured using STREAM copy kernels, parallelly-running on one or two GCDs. For the dual-GCDs cases, the two GCDs are either located on a single physical GPU (same GPU), or on two distinct physical GPUs (spread). The achieved percentage of theoretical bandwidth is presented.
Figure 5: Total bidirectional CPU-GPU bandwidth, measured using STREAM copy kernels, parallelly-running on one to eight GCDs. The achieved percentage of theoretical bandwidth is presented.
...and 7 more figures

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

TL;DR

Abstract

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Authors

TL;DR

Abstract

Table of Contents

Figures (12)