On the Convergence of Malleability and the HPC PowerStack: Exploiting Dynamism in Over-Provisioned and Power-Constrained HPC Systems

Eishi Arima; Isaías A. Comprés; Martin Schulz

On the Convergence of Malleability and the HPC PowerStack: Exploiting Dynamism in Over-Provisioned and Power-Constrained HPC Systems

Eishi Arima, Isaías A. Comprés, Martin Schulz

TL;DR

This paper addresses the challenge of rising power consumption in HPC by proposing a convergence of malleability, co-scheduling, and power management within a PowerStack-inspired software stack. It motivates this direction through technology trends in hardware, programming models, and energy-aware computing, and formulates a problem statement centered on supporting malleable jobs under power constraints via a hierarchical, dynamic resource-management architecture. A strawman design defines four interacting components—System Manager, Job Manager, Node Manager, and Monitor—and outlines explicit requirements and role-specific functions, complemented by site-administrator considerations. The authors describe ongoing integration efforts (e.g., DEEP-SEA, REGALE) and toolchains (Slurm, PMIx, EAR, Countdown, BDPO, BEO, PULP Controller, DCDB, EXAMON) aimed at enabling practical convergence of malleability and PowerStack, with an emphasis on energy-aware policies and co-scheduling. The work envisions significant practical impact by enabling more efficient utilization and energy management in future over-provisioned HPC systems, and it outlines a roadmap toward validation via software integration and potentially trace-driven studies.

Abstract

Recent High-Performance Computing (HPC) systems are facing important challenges, such as massive power consumption, while at the same time significantly under-utilized system resources. Given the power consumption trends, future systems will be deployed in an over-provisioned manner where more resources are installed than they can afford to power simultaneously. In such a scenario, maximizing resource utilization and energy efficiency, while keeping a given power constraint, is pivotal. Driven by this observation, in this position paper we first highlight the recent trends of resource management techniques, with a particular focus on malleability support (i.e., dynamically scaling resource allocations/requirements for a job), co-scheduling (i.e., co-locating multiple jobs within a node), and power management. Second, we consider putting them together, assess their relationships/synergies, and discuss the functionality requirements in each software component for future over-provisioned and power-constrained HPC systems. Third, we briefly introduce our ongoing efforts on the integration of software tools, which will ultimately lead to the convergence of malleability and power management, as it is designed in the HPC PowerStack initiative.

On the Convergence of Malleability and the HPC PowerStack: Exploiting Dynamism in Over-Provisioned and Power-Constrained HPC Systems

TL;DR

Abstract

On the Convergence of Malleability and the HPC PowerStack: Exploiting Dynamism in Over-Provisioned and Power-Constrained HPC Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (2)