Table of Contents
Fetching ...

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

Fabian Knorr, Philip Salzmann, Peter Thoman, Thomas Fahringer

TL;DR

This work addresses the bottlenecks of implicit memory management and coherence in distributed SIMT programs on multi-GPU systems. It introduces instruction-graph scheduling (IDAG), a low-level intermediate representation that captures memory allocations, data transfers, and MPI/kernel operations to maximize concurrency and hide communication latency. Implemented atop the Celerity runtime, IDAG decouples scheduling from execution and adds a scheduler lookahead to elide expensive memory-resize operations for virtualized buffers. Experiments on real hardware show superior strong-scaling for multiple applications up to $128$ GPUs, validating the approach's practicality for large accelerator clusters.

Abstract

Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations, coherence operations and their interdependencies can quickly introduce delays into the latency-sensitive execution pipeline of a distributed-memory application. In this paper, we show how graph-based intermediate representations help moving such scheduling work out of the critical path. In the context of SYCL programs distributed onto accelerator clusters, we introduce the instruction graph, a low-level representation that preserves full concurrency between memory management, data transfers, MPI peer-to-peer communication and kernel invocation. Through integration within the Celerity runtime, we demonstrate how instruction-graph scheduling enables a system architecture that performs this analysis concurrently with execution. Using a scheduler lookahead mechanism, we further detect changing access patterns to optimize memory allocation in the presence of virtualized buffers. We show the effectiveness of our method through strong-scaling benchmarks with multiple Celerity applications on up to 128 GPUs in a production cluster.

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

TL;DR

This work addresses the bottlenecks of implicit memory management and coherence in distributed SIMT programs on multi-GPU systems. It introduces instruction-graph scheduling (IDAG), a low-level intermediate representation that captures memory allocations, data transfers, and MPI/kernel operations to maximize concurrency and hide communication latency. Implemented atop the Celerity runtime, IDAG decouples scheduling from execution and adds a scheduler lookahead to elide expensive memory-resize operations for virtualized buffers. Experiments on real hardware show superior strong-scaling for multiple applications up to GPUs, validating the approach's practicality for large accelerator clusters.

Abstract

Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations, coherence operations and their interdependencies can quickly introduce delays into the latency-sensitive execution pipeline of a distributed-memory application. In this paper, we show how graph-based intermediate representations help moving such scheduling work out of the critical path. In the context of SYCL programs distributed onto accelerator clusters, we introduce the instruction graph, a low-level representation that preserves full concurrency between memory management, data transfers, MPI peer-to-peer communication and kernel invocation. Through integration within the Celerity runtime, we demonstrate how instruction-graph scheduling enables a system architecture that performs this analysis concurrently with execution. Using a scheduler lookahead mechanism, we further detect changing access patterns to optimize memory allocation in the presence of virtualized buffers. We show the effectiveness of our method through strong-scaling benchmarks with multiple Celerity applications on up to 128 GPUs in a production cluster.

Paper Structure

This paper contains 26 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: The proposed instruction graph complements the established task- and command graph intermediate representations for distributed GPU programs. By modelling individual SYCL and MPI operations, it removes dataflow analysis overhead from the critical execution path.
  • Figure 2: Task graph (left) and command graph (right) computed by node $N0$ out of 2. Dataflow dependencies are colored black, anti- and output dependencies green, and graph-synchronization dependencies violet or orange.
  • Figure 3: Because accessors map to a single device-memory pointer, each must be backed by a contiguous allocation. Kernel launches will be preceded by the necessary allocation- and resize-copy instructions in the instruction graph.
  • Figure 4: Instruction graph compiled from the command graph in \ref{['fig:tdag']} for the two local devices D0 and D1 on node N0. Each device receives half the command kernel index space (a quarter of the task kernel index space). Note that all send- and receive instructions in this picture are concurrent, even though they appear far apart due to layout constraints.
  • Figure 5: Proposed concurrent architecture of Celerity with instruction-graph scheduling. The user-controlled main thread, graph scheduler, executor state machines and backends are all decoupled to operate concurrently and communicate over single-producer-single-consumer (spsc) queues. Dashed lines represent thread boundaries.
  • ...and 3 more figures