Modeling and Controlling Many-Core HPC Processors: an Alternative to PID and Moving Average Algorithms
Giovanni Bambini, Alessandro Ottaviano, Christian Conficoni, Andrea Tilli, Luca Benini, Andrea Bartolini
TL;DR
This work tackles dynamic thermal and power management for modern heterogeneous HPC MPSoCs by developing a coupled, multi-variable model that captures power, temperature, voltage, and workload interactions, including actuator non-idealities and exponential leakage. It advances a fuzzy-inspired iterative control policy that replaces traditional PID-based schemes and employs a domain-wise root-finding step to satisfy the $F$–$V$ relationship under shared voltage domains. Through model-in-the-loop and hardware-in-the-loop simulations, the approach achieves up to about a 5× reduction in maximum exceeded temperature and roughly 3.6% faster application runtimes across diverse scenarios, while maintaining robust target compliance and execution progression. The results establish a practical, open, and scalable framework for joint power–thermal control in large, diverse HPC processors and point to future directions in simplifying power-distribution blocks and exploring predictive methods with accelerator support. This work thus provides a concrete, implementable path toward reliable, high-performance, energy-efficient HPC hardware control in the presence of strong couplings and non-ideal actuators.
Abstract
The race towards performance increase and computing power has led to chips with heterogeneous and complex designs, integrating an ever-growing number of cores on the same monolithic chip or chiplet silicon die. Higher integration density, compounded with the slowdown of technology-driven power reduction, implies that power and thermal management become increasingly relevant. Unfortunately, existing research lacks a detailed analysis and modeling of thermal, power, and electrical coupling effects and how they have to be jointly considered to perform dynamic control of complex and heterogeneous Multi-Processor System on Chips (MPSoCs). To close the gap, in this work, we first provide a detailed thermal and power model targeting a modern High Performance Computing (HPC) MPSoC. We consider real-world coupling effects such as actuators' non-idealities and the exponential relation between the dissipated power, the temperature state, and the voltage level in a single processing element. We analyze how these factors affect the control algorithm behavior and the type of challenges that they pose. Based on the analysis, we propose a thermal capping strategy inspired by Fuzzy control theory to replace the state-of-the-art PID controller, as well as a root-finding iterative method to optimally choose the shared voltage value among cores grouped in the same voltage domain. We evaluate the proposed controller with model-in-the-loop and hardware-in-the-loop co-simulations. We show an improvement over state-of-the-art methods of up to 5x the maximum exceeded temperature while providing an average of 3.56% faster application execution runtime across all the evaluation scenarios.
