Table of Contents
Fetching ...

Modeling and Controlling Many-Core HPC Processors: an Alternative to PID and Moving Average Algorithms

Giovanni Bambini, Alessandro Ottaviano, Christian Conficoni, Andrea Tilli, Luca Benini, Andrea Bartolini

TL;DR

This work tackles dynamic thermal and power management for modern heterogeneous HPC MPSoCs by developing a coupled, multi-variable model that captures power, temperature, voltage, and workload interactions, including actuator non-idealities and exponential leakage. It advances a fuzzy-inspired iterative control policy that replaces traditional PID-based schemes and employs a domain-wise root-finding step to satisfy the $F$–$V$ relationship under shared voltage domains. Through model-in-the-loop and hardware-in-the-loop simulations, the approach achieves up to about a 5× reduction in maximum exceeded temperature and roughly 3.6% faster application runtimes across diverse scenarios, while maintaining robust target compliance and execution progression. The results establish a practical, open, and scalable framework for joint power–thermal control in large, diverse HPC processors and point to future directions in simplifying power-distribution blocks and exploring predictive methods with accelerator support. This work thus provides a concrete, implementable path toward reliable, high-performance, energy-efficient HPC hardware control in the presence of strong couplings and non-ideal actuators.

Abstract

The race towards performance increase and computing power has led to chips with heterogeneous and complex designs, integrating an ever-growing number of cores on the same monolithic chip or chiplet silicon die. Higher integration density, compounded with the slowdown of technology-driven power reduction, implies that power and thermal management become increasingly relevant. Unfortunately, existing research lacks a detailed analysis and modeling of thermal, power, and electrical coupling effects and how they have to be jointly considered to perform dynamic control of complex and heterogeneous Multi-Processor System on Chips (MPSoCs). To close the gap, in this work, we first provide a detailed thermal and power model targeting a modern High Performance Computing (HPC) MPSoC. We consider real-world coupling effects such as actuators' non-idealities and the exponential relation between the dissipated power, the temperature state, and the voltage level in a single processing element. We analyze how these factors affect the control algorithm behavior and the type of challenges that they pose. Based on the analysis, we propose a thermal capping strategy inspired by Fuzzy control theory to replace the state-of-the-art PID controller, as well as a root-finding iterative method to optimally choose the shared voltage value among cores grouped in the same voltage domain. We evaluate the proposed controller with model-in-the-loop and hardware-in-the-loop co-simulations. We show an improvement over state-of-the-art methods of up to 5x the maximum exceeded temperature while providing an average of 3.56% faster application execution runtime across all the evaluation scenarios.

Modeling and Controlling Many-Core HPC Processors: an Alternative to PID and Moving Average Algorithms

TL;DR

This work tackles dynamic thermal and power management for modern heterogeneous HPC MPSoCs by developing a coupled, multi-variable model that captures power, temperature, voltage, and workload interactions, including actuator non-idealities and exponential leakage. It advances a fuzzy-inspired iterative control policy that replaces traditional PID-based schemes and employs a domain-wise root-finding step to satisfy the relationship under shared voltage domains. Through model-in-the-loop and hardware-in-the-loop simulations, the approach achieves up to about a 5× reduction in maximum exceeded temperature and roughly 3.6% faster application runtimes across diverse scenarios, while maintaining robust target compliance and execution progression. The results establish a practical, open, and scalable framework for joint power–thermal control in large, diverse HPC processors and point to future directions in simplifying power-distribution blocks and exploring predictive methods with accelerator support. This work thus provides a concrete, implementable path toward reliable, high-performance, energy-efficient HPC hardware control in the presence of strong couplings and non-ideal actuators.

Abstract

The race towards performance increase and computing power has led to chips with heterogeneous and complex designs, integrating an ever-growing number of cores on the same monolithic chip or chiplet silicon die. Higher integration density, compounded with the slowdown of technology-driven power reduction, implies that power and thermal management become increasingly relevant. Unfortunately, existing research lacks a detailed analysis and modeling of thermal, power, and electrical coupling effects and how they have to be jointly considered to perform dynamic control of complex and heterogeneous Multi-Processor System on Chips (MPSoCs). To close the gap, in this work, we first provide a detailed thermal and power model targeting a modern High Performance Computing (HPC) MPSoC. We consider real-world coupling effects such as actuators' non-idealities and the exponential relation between the dissipated power, the temperature state, and the voltage level in a single processing element. We analyze how these factors affect the control algorithm behavior and the type of challenges that they pose. Based on the analysis, we propose a thermal capping strategy inspired by Fuzzy control theory to replace the state-of-the-art PID controller, as well as a root-finding iterative method to optimally choose the shared voltage value among cores grouped in the same voltage domain. We evaluate the proposed controller with model-in-the-loop and hardware-in-the-loop co-simulations. We show an improvement over state-of-the-art methods of up to 5x the maximum exceeded temperature while providing an average of 3.56% faster application execution runtime across all the evaluation scenarios.
Paper Structure (28 sections, 14 equations, 13 figures, 3 tables)

This paper contains 28 sections, 14 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Architecture of the cascade control blocks in an system. The three blocks have separate time domains, their own abstraction of the system to be controlled, and distinct external information depending on their scope. In particular, the (inner light green block) controls the low-level, physical parts of the system (vrm, ldo, dt, and pll) based on the target operating points given by the local . The also fetches information from the pvt and other sensors and communicates it to the . The (middle dark green block) takes information from the about the system as well as information from the executing application and computes a set of operating points. Finally, the global (external light blue block) takes information from the system and from the surroundings and communicates it to a performance/energy efficiency target to each local .
  • Figure 2: Representation of a -component with the four main elements of the control problem (the inputs Frequency ($F$) and Voltage ($V$), and the outputs Temperature ($T$) and consumed Power ($P$)) and the workload $\omega_i(t)$. The power consumption $P*$ is grayed out because it is not directly measurable per single . Instead, the power consumption measure is provided for each power rail (i.e. group of ).
  • Figure 3: Representation of the HPC Processor thermal architecture. The die contains all the which are the source of the heat production through power dissipation. The main heat dissipation path is indicated by the red arrow passing through the Heat Spreader, the Heat sink, and two TIMs layers. The secondary heat dissipation path goes through the layer below the die to the Motherboard.
  • Figure 4: Example of finite elements spatial discretization, with corresponding lumped parameters model of a single . In the model, the green part (below) concerns the core, with $P_k$ being the consumed power generated by the , the yellow part (above) concerns the heat-spreader, with $T_E$ being the main dissipation path to the Heat Sink.
  • Figure 5: Multi-view comparison between the proposed model of the leakage power characterized by an exponential relationship with Voltage and Temperature (blue surface), the same model of leakage power with no exponential relation (magenta surface), and the minimum and maximum dynamic power (yellow and red surface respectively).
  • ...and 8 more figures