Table of Contents
Fetching ...

Transforming Future Data Center Operations and Management via Physical AI

Zhiwei Cao, Minghao Li, Feng Lin, Jimin Jia, Yonggang Wen, Jianxiong Yin, Simon See

TL;DR

This paper introduces Physical AI (PhyAI) as an integrated framework to transform data center operations by coupling high-fidelity physical simulators, a physics-informed AI engine, and a 5-tier Omniverse-based digital twin. It presents a modular, scalable pipeline that generates synthetic data from in-house simulators, trains surrogate models with NVIDIA PhysicsNemo, and enables real-time predictive and prescriptive control via DRL and learning-based MPC. A detailed case study on rapid CFD/HT surrogate modeling for a large-scale production DC demonstrates real-time inference with a median temperature error of 0.18 °C and latency around 0.01 s, outperforming traditional CFD/HT approaches. The work outlines concrete future directions, including foundation models, physics-informed training, quantum acceleration, and differentiable simulation, to further enhance data center autonomy, efficiency, and reliability. Overall, PhyAI offers a practical, data-efficient path toward AI-native, autonomous DC operations with real-time digital twins and physics-consistent decision-making.

Abstract

Data centers (DCs) as mission-critical infrastructures are pivotal in powering the growth of artificial intelligence (AI) and the digital economy. The evolution from Internet DC to AI DC has introduced new challenges in operating and managing data centers for improved business resilience and reduced total cost of ownership. As a result, new paradigms, beyond the traditional approaches based on best practices, must be in order for future data centers. In this research, we propose and develop a novel Physical AI (PhyAI) framework for advancing DC operations and management. Our system leverages the emerging capabilities of state-of-the-art industrial products and our in-house research and development. Specifically, it presents three core modules, namely: 1) an industry-grade in-house simulation engine to simulate DC operations in a highly accurate manner, 2) an AI engine built upon NVIDIA PhysicsNemo for the training and evaluation of physics-informed machine learning (PIML) models, and 3) a digital twin platform built upon NVIDIA Omniverse for our proposed 5-tier digital twin framework. This system presents a scalable and adaptable solution to digitalize, optimize, and automate future data center operations and management, by enabling real-time digital twins for future data centers. To illustrate its effectiveness, we present a compelling case study on building a surrogate model for predicting the thermal and airflow profiles of a large-scale DC in a real-time manner. Our results demonstrate its superior performance over traditional time-consuming Computational Fluid Dynamics/Heat Transfer (CFD/HT) simulation, with a median absolute temperature prediction error of 0.18 °C. This emerging approach would open doors to several potential research directions for advancing Physical AI in future DC operations.

Transforming Future Data Center Operations and Management via Physical AI

TL;DR

This paper introduces Physical AI (PhyAI) as an integrated framework to transform data center operations by coupling high-fidelity physical simulators, a physics-informed AI engine, and a 5-tier Omniverse-based digital twin. It presents a modular, scalable pipeline that generates synthetic data from in-house simulators, trains surrogate models with NVIDIA PhysicsNemo, and enables real-time predictive and prescriptive control via DRL and learning-based MPC. A detailed case study on rapid CFD/HT surrogate modeling for a large-scale production DC demonstrates real-time inference with a median temperature error of 0.18 °C and latency around 0.01 s, outperforming traditional CFD/HT approaches. The work outlines concrete future directions, including foundation models, physics-informed training, quantum acceleration, and differentiable simulation, to further enhance data center autonomy, efficiency, and reliability. Overall, PhyAI offers a practical, data-efficient path toward AI-native, autonomous DC operations with real-time digital twins and physics-consistent decision-making.

Abstract

Data centers (DCs) as mission-critical infrastructures are pivotal in powering the growth of artificial intelligence (AI) and the digital economy. The evolution from Internet DC to AI DC has introduced new challenges in operating and managing data centers for improved business resilience and reduced total cost of ownership. As a result, new paradigms, beyond the traditional approaches based on best practices, must be in order for future data centers. In this research, we propose and develop a novel Physical AI (PhyAI) framework for advancing DC operations and management. Our system leverages the emerging capabilities of state-of-the-art industrial products and our in-house research and development. Specifically, it presents three core modules, namely: 1) an industry-grade in-house simulation engine to simulate DC operations in a highly accurate manner, 2) an AI engine built upon NVIDIA PhysicsNemo for the training and evaluation of physics-informed machine learning (PIML) models, and 3) a digital twin platform built upon NVIDIA Omniverse for our proposed 5-tier digital twin framework. This system presents a scalable and adaptable solution to digitalize, optimize, and automate future data center operations and management, by enabling real-time digital twins for future data centers. To illustrate its effectiveness, we present a compelling case study on building a surrogate model for predicting the thermal and airflow profiles of a large-scale DC in a real-time manner. Our results demonstrate its superior performance over traditional time-consuming Computational Fluid Dynamics/Heat Transfer (CFD/HT) simulation, with a median absolute temperature prediction error of 0.18 °C. This emerging approach would open doors to several potential research directions for advancing Physical AI in future DC operations.

Paper Structure

This paper contains 24 sections, 5 figures.

Figures (5)

  • Figure 1: Illustration of the typical DC physical infrastructure, which consists of the IT system, the cooling system, and the power supply system. The IT system hosts various workloads, ranging from traditional cloud services, networking, and storage workloads to emerging AI workloads. To support the high-density AI workloads, hybrid cooling systems (liquid cooling + air cooling) will be widely adopted in the future. To mitigate the carbon and energy footprint, mixed energy systems that integrate multiple energy sources including traditional grid electricity and green energy are prevailing.
  • Figure 2: Illustration of the PhyAI-driven system for autonomous DC operations. We build the 3D representation of a physical DC with NVIDIA Omniverse, a powerful platform for constructing and rendering 3D environments with the RTX rending technology. We also enable real-time sensory data visualization and analysis on Omniverse. Built on Omniverse, we develop a bidirectional converter to seamlessly transfer data between our in-house physics-based simulators for system-level DC simulation. The synthetic dataset is fed to the PhyAI engine built on NVIDIA PhysicsNemo, a modern framework for large-scale physics-informed AI model training and inference. The trained PhyAI model is deployed in physical DCs for predictive and prescriptive analysis.
  • Figure 3: Illustration of the proposed physics-informed model architecture. We conduct domain decomposition and split the computation domain into three parts, i.e., the raised floor area, the cold aisle, and the hot aisle as the flow and thermal profiles in the three areas are significantly distinct. In addition, separate modeling of the thermal and airflow is adopted as the fluid field and thermal field are loosely coupled in the context of the room-level cooling process. We first simulate the flow field and then use the simulated flow field to infer the thermal field according to the governing equations. To ease the training, we first train the fluid network and then freeze it to train the thermal network.
  • Figure 4: Illustration of the layout of the considered DC with NVIDIA Omniverse. The data hall contains 4 rows of racks, with a total rack number equaling 60. 317 servers are installed in the racks. 6 ACU units are installed to provide cold air. Hot aisle containment is equipped to improve cooling efficiency.
  • Figure 5: Illustration of the predicted and the true temperature and airflow profile, as well as the absolute temperature prediction error distribution. The airflow profile is visualized with streamlines in Omniverse. It can be seen that the surrogate model produces visually similar thermal and airflow profiles compared with the ground truth. The quantitative result shows that for most areas within the data hall, the absolute temperature prediction error is within 2.5 ° C with a median absolute error of 0.18 ° C.