Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

Fabian Knorr; Philip Salzmann; Peter Thoman; Thomas Fahringer

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

Fabian Knorr, Philip Salzmann, Peter Thoman, Thomas Fahringer

TL;DR

This work addresses the bottlenecks of implicit memory management and coherence in distributed SIMT programs on multi-GPU systems. It introduces instruction-graph scheduling (IDAG), a low-level intermediate representation that captures memory allocations, data transfers, and MPI/kernel operations to maximize concurrency and hide communication latency. Implemented atop the Celerity runtime, IDAG decouples scheduling from execution and adds a scheduler lookahead to elide expensive memory-resize operations for virtualized buffers. Experiments on real hardware show superior strong-scaling for multiple applications up to $128$ GPUs, validating the approach's practicality for large accelerator clusters.

Abstract

Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations, coherence operations and their interdependencies can quickly introduce delays into the latency-sensitive execution pipeline of a distributed-memory application. In this paper, we show how graph-based intermediate representations help moving such scheduling work out of the critical path. In the context of SYCL programs distributed onto accelerator clusters, we introduce the instruction graph, a low-level representation that preserves full concurrency between memory management, data transfers, MPI peer-to-peer communication and kernel invocation. Through integration within the Celerity runtime, we demonstrate how instruction-graph scheduling enables a system architecture that performs this analysis concurrently with execution. Using a scheduler lookahead mechanism, we further detect changing access patterns to optimize memory allocation in the presence of virtualized buffers. We show the effectiveness of our method through strong-scaling benchmarks with multiple Celerity applications on up to 128 GPUs in a production cluster.

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

TL;DR

Abstract

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)