Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning

Issa Saba; Eishi Arima; Dai Liu; Martin Schulz

Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning

Issa Saba, Eishi Arima, Dai Liu, Martin Schulz

TL;DR

The paper addresses throughput optimization on CPU-GPU heterogeneous systems under a total power cap $P_{total}$ by jointly optimizing co-scheduling, resource partitioning, and power capping. It introduces a slowdown predictor driven by hardware counters and knob states to estimate co-run performance and uses a graph-based Edmonds' matching algorithm to select optimal job sets and hardware configurations that minimize $CoRunTime(JS,HC)$ while honoring $CoRunTime(JS,HC) \le SoloRunTime(JS,P_{total})$ and $P_i^c+P_i^g \le P_{total}$. Offline, benchmarks populate the predictor; online, the scheduler exhaustively searches hardware configurations per job set and applies Edmonds' algorithm to form a minimum-weight perfect matching of co-scheduled pairs. Experiments on a real platform with an AMD Ryzen Threadripper and NVIDIA A100 MIG demonstrate up to 67% speedup over time-sharing with naive power distribution, with accurate time predictions and responsive adaptation to different power budgets.

Abstract

CPU-GPU heterogeneous architectures are now commonly used in a wide variety of computing systems from mobile devices to supercomputers. Maximizing the throughput for multi-programmed workloads on such systems is indispensable as one single program typically cannot fully exploit all available resources. At the same time, power consumption is a key issue and often requires optimizing power allocations to the CPU and GPU while enforcing a total power constraint, in particular when the power/thermal requirements are strict. The result is a system-wide optimization problem with several knobs. In particular we focus on (1) co-scheduling decisions, i.e., selecting programs to co-locate in a space sharing manner; (2) resource partitioning on both CPUs and GPUs; and (3) power capping on both CPUs and GPUs. We solve this problem using predictive performance modeling using machine learning in order to coordinately optimize the above knob setups. Our experiential results using a real system show that our approach achieves up to 67% of speedup compared to a time-sharing-based scheduling with a naive power capping that evenly distributes power budgets across components.

Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning

TL;DR

The paper addresses throughput optimization on CPU-GPU heterogeneous systems under a total power cap

by jointly optimizing co-scheduling, resource partitioning, and power capping. It introduces a slowdown predictor driven by hardware counters and knob states to estimate co-run performance and uses a graph-based Edmonds' matching algorithm to select optimal job sets and hardware configurations that minimize

while honoring

and

. Offline, benchmarks populate the predictor; online, the scheduler exhaustively searches hardware configurations per job set and applies Edmonds' algorithm to form a minimum-weight perfect matching of co-scheduled pairs. Experiments on a real platform with an AMD Ryzen Threadripper and NVIDIA A100 MIG demonstrate up to 67% speedup over time-sharing with naive power distribution, with accurate time predictions and responsive adaptation to different power budgets.

Abstract

Paper Structure (20 sections, 2 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 2 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Motivation, Problem, and Solution Overview
Motivation: Technology Trends
Problem Definition
Solution Overview
Modeling and Optimization
Slowdown Estimation for a Given Job Set and Hardware Setup
Metric Formulations:
Performance Modeling:
Hardware Setup Optimization for a Given Job Set
Job Sets Selection
Evaluation
Evaluation Setup
Environment
...and 5 more sections

Figures (7)

Figure 1: Problem Overview
Figure 2: Workflow of Our Solution
Figure 3: General Structure of Our Performance Modeling ($C=2$)
Figure 4: Overview of Graph-based Job Sets Creation ($W=6$, $C=2$)
Figure 5:
...and 2 more figures

Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning

TL;DR

Abstract

Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)