Table of Contents
Fetching ...

Easy Acceleration with Distributed Arrays

Jeremy Kepner, Chansup Byun, LaToya Anderson, William Arcand, David Bestor, William Bergeron, Alex Bonn, Daniel Burrill, Vijay Gadepally, Ryan Haney, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Piotr Luszczek, Lauren Milechin, Guillermo Morales, Julie Mullen, Andrew Prout, Albert Reuther, Antonio Rosa, Charles Yee, Peter Michaleas

TL;DR

The paper addresses the challenge of delivering scalable performance in high-level languages on CPUs and GPUs across vertical, horizontal, and temporal dimensions. It adopts distributed arrays (PGAS) and the STREAM memory-bandwidth benchmark to evaluate memory throughput across diverse hardware, using identical software stacks on the MIT SuperCloud to enable cross-era comparisons. Key findings include linear horizontal scaling across nodes, substantial memory-bandwidth gains over decades (e.g., a 10x CPU-core, 100x CPU-node, and 5x GPU-node improvement), and sustained bandwidth exceeding $>1\,\mathrm{PB/s}$ on hundreds of nodes. The work demonstrates that distributed arrays provide a productive abstraction for scalable HPC in high-level languages and highlights the practical impact of memory-bandwidth improvements for large-scale scientific computing.”

Abstract

High level programming languages and GPU accelerators are powerful enablers for a wide range of applications. Achieving scalable vertical (within a compute node), horizontal (across compute nodes), and temporal (over different generations of hardware) performance while retaining productivity requires effective abstractions. Distributed arrays are one such abstraction that enables high level programming to achieve highly scalable performance. Distributed arrays achieve this performance by deriving parallelism from data locality, which naturally leads to high memory bandwidth efficiency. This paper explores distributed array performance using the STREAM memory bandwidth benchmark on a variety of hardware. Scalable performance is demonstrated within and across CPU cores, CPU nodes, and GPU nodes. Horizontal scaling across multiple nodes was linear. The hardware used spans decades and allows a direct comparison of hardware improvements for memory bandwidth over this time range; showing a 10x increase in CPU core bandwidth over 20 years, 100x increase in CPU node bandwidth over 20 years, and 5x increase in GPU node bandwidth over 5 years. Running on hundreds of MIT SuperCloud nodes simultaneously achieved a sustained bandwidth $>$1 PB/s.

Easy Acceleration with Distributed Arrays

TL;DR

The paper addresses the challenge of delivering scalable performance in high-level languages on CPUs and GPUs across vertical, horizontal, and temporal dimensions. It adopts distributed arrays (PGAS) and the STREAM memory-bandwidth benchmark to evaluate memory throughput across diverse hardware, using identical software stacks on the MIT SuperCloud to enable cross-era comparisons. Key findings include linear horizontal scaling across nodes, substantial memory-bandwidth gains over decades (e.g., a 10x CPU-core, 100x CPU-node, and 5x GPU-node improvement), and sustained bandwidth exceeding on hundreds of nodes. The work demonstrates that distributed arrays provide a productive abstraction for scalable HPC in high-level languages and highlights the practical impact of memory-bandwidth improvements for large-scale scientific computing.”

Abstract

High level programming languages and GPU accelerators are powerful enablers for a wide range of applications. Achieving scalable vertical (within a compute node), horizontal (across compute nodes), and temporal (over different generations of hardware) performance while retaining productivity requires effective abstractions. Distributed arrays are one such abstraction that enables high level programming to achieve highly scalable performance. Distributed arrays achieve this performance by deriving parallelism from data locality, which naturally leads to high memory bandwidth efficiency. This paper explores distributed array performance using the STREAM memory bandwidth benchmark on a variety of hardware. Scalable performance is demonstrated within and across CPU cores, CPU nodes, and GPU nodes. Horizontal scaling across multiple nodes was linear. The hardware used spans decades and allows a direct comparison of hardware improvements for memory bandwidth over this time range; showing a 10x increase in CPU core bandwidth over 20 years, 100x increase in CPU node bandwidth over 20 years, and 5x increase in GPU node bandwidth over 5 years. Running on hundreds of MIT SuperCloud nodes simultaneously achieved a sustained bandwidth 1 PB/s.

Paper Structure

This paper contains 7 sections, 3 equations, 4 figures, 2 tables, 2 algorithms.

Figures (4)

  • Figure 1: Distributed Array Mappings (adapted from Kepner2009). Different parallel mappings of a two-dimensional array. Arrays can be broken up in any dimension. A block mapping means that each $P_{ID}$ holds a contiguous piece of the array. Overlap allows the boundaries of an array to be stored on two neighboring $P_{ID}$s.
  • Figure 2: Parallel Stream Design. Each vector is a distributed array. If each vector has the same parallel map, then the resulting program will require no communication.
  • Figure 3: Measured Bandwidth. Matlab, Octave, and Python Stream triad bandwidth using distributed arrays for the different hardware configurations (see Table \ref{['tab:HardwareTable']}) run with the parameters listed in Table \ref{['tab:STREAMparameters']}. The bg-p data was adapted from byun2010toward. The xeon-p4 data was adapted from haney2004pmatlab. All plots show excellent vertical scaling within a node, horizontal scaling across nodes, and temporal scaling over multiple eras of hardware.
  • Figure 4: Temporal Scaling. Stream triad bandwidth of hardware at different eras for a single process on a single core running a single thread (bottom black line), multiple processes on multiple cores each running a single thread (middle blue line), and 2 processes on a 2 GPU node (top green line). These benchmark data indicate a 10x increase in single-core bandwidth over 20 years, a 100x increase in single node bandwidth over 20 years, and a 5x increase in single GPU node performance over 5 years.