Table of Contents
Fetching ...

Exploring Uncore Frequency Scaling for Heterogeneous Computing

Zhong Zheng, Seyfal Sultanov, Michael E. Papka, Zhiling Lan

TL;DR

The paper addresses the energy-inefficiency of uncore frequency scaling in heterogeneous CPU-GPU HPC systems, where GPU-dominated workloads rarely push CPU power to the TDP, causing wasted uncore power. It introduces MAGUS, a model-free, lightweight runtime that uses memory throughput and memory-dynamics to predict and stabilize uncore frequency, with a two-phase design for throughput prediction and rapid phase-change detection. MAGUS achieves up to 27% energy savings and 26% ED(P) reduction (and up to 34% ED(P) savings versus UPS) while maintaining under 5% performance loss and under 1% runtime overhead across diverse single- and multi-GPU workloads, demonstrating robust applicability and practicality. This work provides a concrete, scalable mechanism for energy optimization in heterogeneous HPC, with implications for future exascale systems and GPU-centric AI workloads.

Abstract

High-performance computing (HPC) systems are essential for scientific discovery and engineering innovation. However, their growing power demands pose significant challenges, particularly as systems scale to the exascale level. Prior uncore frequency tuning studies have primarily focused on conventional HPC workloads running on homogeneous systems. As HPC advances toward heterogeneous computing, integrating diverse GPU workloads on heterogeneous CPU-GPU systems, it is crucial to revisit and enhance uncore scaling. Our investigation reveals that uncore frequency scales down only when CPU power approaches its TDP (Thermal Design Power), an uncommon scenario in GPU-dominant applications, resulting in unnecessary power waste in modern heterogeneous computing systems. To address this, we present MAGUS, a user-transparent uncore frequency scaling runtime for heterogeneous computing. Effective uncore tuning is inherently complex, requiring dynamic detection of application execution phases that affect uncore utilization. Moreover, any robust strategy must work across a diverse range of applications, each with unique behaviors and resource requirements. Finally, an efficient runtime should introduce minimal overhead. We incorporate several key techniques in the design of MAGUS, including monitoring and predicting memory throughput, managing frequent phase transitions, and leveraging vendor-supplied power management support. We evaluate MAGUS using a diverse set of GPU benchmarks and applications across multiple heterogeneous systems with different CPU and GPU architectures. The experimental results show that MAGUS achieves up to 27% energy savings and 26% energy-delay product (EDP) reduction compared to the default settings while maintaining a performance loss below 5% and an overhead under 1%.

Exploring Uncore Frequency Scaling for Heterogeneous Computing

TL;DR

The paper addresses the energy-inefficiency of uncore frequency scaling in heterogeneous CPU-GPU HPC systems, where GPU-dominated workloads rarely push CPU power to the TDP, causing wasted uncore power. It introduces MAGUS, a model-free, lightweight runtime that uses memory throughput and memory-dynamics to predict and stabilize uncore frequency, with a two-phase design for throughput prediction and rapid phase-change detection. MAGUS achieves up to 27% energy savings and 26% ED(P) reduction (and up to 34% ED(P) savings versus UPS) while maintaining under 5% performance loss and under 1% runtime overhead across diverse single- and multi-GPU workloads, demonstrating robust applicability and practicality. This work provides a concrete, scalable mechanism for energy optimization in heterogeneous HPC, with implications for future exascale systems and GPU-centric AI workloads.

Abstract

High-performance computing (HPC) systems are essential for scientific discovery and engineering innovation. However, their growing power demands pose significant challenges, particularly as systems scale to the exascale level. Prior uncore frequency tuning studies have primarily focused on conventional HPC workloads running on homogeneous systems. As HPC advances toward heterogeneous computing, integrating diverse GPU workloads on heterogeneous CPU-GPU systems, it is crucial to revisit and enhance uncore scaling. Our investigation reveals that uncore frequency scales down only when CPU power approaches its TDP (Thermal Design Power), an uncommon scenario in GPU-dominant applications, resulting in unnecessary power waste in modern heterogeneous computing systems. To address this, we present MAGUS, a user-transparent uncore frequency scaling runtime for heterogeneous computing. Effective uncore tuning is inherently complex, requiring dynamic detection of application execution phases that affect uncore utilization. Moreover, any robust strategy must work across a diverse range of applications, each with unique behaviors and resource requirements. Finally, an efficient runtime should introduce minimal overhead. We incorporate several key techniques in the design of MAGUS, including monitoring and predicting memory throughput, managing frequent phase transitions, and leveraging vendor-supplied power management support. We evaluate MAGUS using a diverse set of GPU benchmarks and applications across multiple heterogeneous systems with different CPU and GPU architectures. The experimental results show that MAGUS achieves up to 27% energy savings and 26% energy-delay product (EDP) reduction compared to the default settings while maintaining a performance loss below 5% and an overhead under 1%.

Paper Structure

This paper contains 18 sections, 9 figures, 1 table, 2 algorithms.

Figures (9)

  • Figure 1: UNet characterization on a heterogeneous Intel Xeon CPU–A100 GPU node. Each socket contains 40 hardware cores; for readability, we plot the core frequency of only four of these cores.
  • Figure 2: Power profiles of UNet training under different uncore frequencies: max (2.2 GHz) versus min (0.8 GHz).
  • Figure 3: MAGUS Overview. MAGUS comprises three main components: (1) Memory Throughput Monitor, (2) Memory Throughput Predictor, and (3) High-Frequency Memory Throughput Changing Detector, each being highlighted in a different color.
  • Figure 4: Overall performance of the benchmarks and applications on Intel+A100. The X-axis lists the benchmarks and applications, while the Y-axis shows the corresponding metrics achieved by MAGUS and UPS against the baseline.
  • Figure 5: Overall performance on Intel+MAX1550. The X-axis lists the benchmarks, while the Y-axis shows the corresponding metrics achieved by MAGUS and UPS against the baseline.
  • ...and 4 more figures