Table of Contents
Fetching ...

Digital Twin-Based Cooling System Optimization for Data Center

Shrenik Amol Jadhav, Zheng Liu

Abstract

Data center cooling systems consume significant auxiliary energy, yet optimization studies rarely quantify the gap between theoretically optimal and operationally deployable control strategies. This paper develops a digital twin of the liquid cooling infrastructure at the Frontier exascale supercomputer, in which a hot-temperature water system comprises three parallel subloops, each serving dedicated coolant distribution unit clusters through plate heat exchangers and variable-speed pumps. The surrogate model is built based on Modelica and validated through one full calendar year of 10-minute operational data following ASHRAE Guideline 14. The model achieves a subloop coefficient of variation of the root mean square error below 2.7% and a normalized mean bias error within 2.5%. Using this validated surrogate model, a layered optimization framework evaluates three progressively constrained strategies: an analytical flow-only optimization achieves 20.4% total energy saving, unconstrained joint optimization of flow rate and supply temperature demonstrates 30.1% total energy saving, and ramp-constrained optimization of flow rate and supply temperature, enforcing actuator rate limits, can reach total energy saving of 27.8%. The analysis reveals that the baseline system operates at 2.9 times the minimum thermally safe flow rate, and the co-optimizing supply temperature with flow rate nearly doubles the savings achievable by flow reduction alone.

Digital Twin-Based Cooling System Optimization for Data Center

Abstract

Data center cooling systems consume significant auxiliary energy, yet optimization studies rarely quantify the gap between theoretically optimal and operationally deployable control strategies. This paper develops a digital twin of the liquid cooling infrastructure at the Frontier exascale supercomputer, in which a hot-temperature water system comprises three parallel subloops, each serving dedicated coolant distribution unit clusters through plate heat exchangers and variable-speed pumps. The surrogate model is built based on Modelica and validated through one full calendar year of 10-minute operational data following ASHRAE Guideline 14. The model achieves a subloop coefficient of variation of the root mean square error below 2.7% and a normalized mean bias error within 2.5%. Using this validated surrogate model, a layered optimization framework evaluates three progressively constrained strategies: an analytical flow-only optimization achieves 20.4% total energy saving, unconstrained joint optimization of flow rate and supply temperature demonstrates 30.1% total energy saving, and ramp-constrained optimization of flow rate and supply temperature, enforcing actuator rate limits, can reach total energy saving of 27.8%. The analysis reveals that the baseline system operates at 2.9 times the minimum thermally safe flow rate, and the co-optimizing supply temperature with flow rate nearly doubles the savings achievable by flow reduction alone.
Paper Structure (43 sections, 18 equations, 13 figures, 5 tables)

This paper contains 43 sections, 18 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Schematic of the Frontier supercomputer cooling architecture. The HTW system comprises three parallel subloops, each serving dedicated CDU clusters through plate heat exchangers. A variable-speed pump drives the supply flow from the cooling tower basin through the subloops, and warm return flow is rejected to the atmosphere via a mechanical-draft cooling tower. State variables used in the digital twin model are annotated at measurement points.
  • Figure 2: Digital twin framework for cooling system optimization. The physical layer provides operational data, the Modelica-based digital twin predicts thermal and energy performance, and the optimization layer determines energy-minimal control setpoints subject to thermal safety constraints.
  • Figure 3: Annual operating conditions for the Frontier HTW cooling loop in 2023: (a) total computational heat load, (b) supply and return temperatures with 42$^\circ$C constraint threshold, and (c) total pump flow rate. Gray traces show 10-minute data; colored lines show 7-day rolling means.
  • Figure 4: Baseline over-pumping diagnosis: (a) measured baseline flow versus analytically computed minimum safe flow, and (b) distribution of the over-pumping ratio. The median ratio of $1.5\times$ indicates systematic over-pumping across all operating conditions.
  • Figure 5: Annual cooling energy breakdown by component (pump vs. CT fan) for each optimization strategy. Stacked bars show the relative contributions; total values are annotated in thousands of kWh.
  • ...and 8 more figures