Table of Contents
Fetching ...

Improving AI Efficiency in Data Centres by Power Dynamic Response

Andrea Marinoni, Sai Shivareddy, Pietro Lio', Weisi Lin, Erik Cambria, Clare Grey

TL;DR

This paper addresses the power management challenge in AI data centres driven by the growth of large AI models. It proposes dynamic power response by making part of the input power highly dynamic to absorb spikes, contrasting passive and active energy-storage strategies. It provides spike-characterization results from real platforms and quantifies potential computational gains, CAPEX savings, and management costs under different device choices. The findings indicate that dynamic power response, especially with active storage, can reduce downtime and environmental impact while enabling hardware-aware AI workloads. The work suggests integrating power management with intelligent control and scheduling for greener, more resilient AI infrastructure.

Abstract

The steady growth of artificial intelligence (AI) has accelerated in the recent years, facilitated by the development of sophisticated models such as large language models and foundation models. Ensuring robust and reliable power infrastructures is fundamental to take advantage of the full potential of AI. However, AI data centres are extremely hungry for power, putting the problem of their power management in the spotlight, especially with respect to their impact on environment and sustainable development. In this work, we investigate the capacity and limits of solutions based on an innovative approach for the power management of AI data centres, i.e., making part of the input power as dynamic as the power used for data-computing functions. The performance of passive and active devices are quantified and compared in terms of computational gain, energy efficiency, reduction of capital expenditure, and management costs by analysing power trends from multiple data platforms worldwide. This strategy, which identifies a paradigm shift in the AI data centre power management, has the potential to strongly improve the sustainability of AI hyperscalers, enhancing their footprint on environmental, financial, and societal fields.

Improving AI Efficiency in Data Centres by Power Dynamic Response

TL;DR

This paper addresses the power management challenge in AI data centres driven by the growth of large AI models. It proposes dynamic power response by making part of the input power highly dynamic to absorb spikes, contrasting passive and active energy-storage strategies. It provides spike-characterization results from real platforms and quantifies potential computational gains, CAPEX savings, and management costs under different device choices. The findings indicate that dynamic power response, especially with active storage, can reduce downtime and environmental impact while enabling hardware-aware AI workloads. The work suggests integrating power management with intelligent control and scheduling for greener, more resilient AI infrastructure.

Abstract

The steady growth of artificial intelligence (AI) has accelerated in the recent years, facilitated by the development of sophisticated models such as large language models and foundation models. Ensuring robust and reliable power infrastructures is fundamental to take advantage of the full potential of AI. However, AI data centres are extremely hungry for power, putting the problem of their power management in the spotlight, especially with respect to their impact on environment and sustainable development. In this work, we investigate the capacity and limits of solutions based on an innovative approach for the power management of AI data centres, i.e., making part of the input power as dynamic as the power used for data-computing functions. The performance of passive and active devices are quantified and compared in terms of computational gain, energy efficiency, reduction of capital expenditure, and management costs by analysing power trends from multiple data platforms worldwide. This strategy, which identifies a paradigm shift in the AI data centre power management, has the potential to strongly improve the sustainability of AI hyperscalers, enhancing their footprint on environmental, financial, and societal fields.

Paper Structure

This paper contains 4 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: Typical trends of AI accelerator power draw (light blue line) through time. (A): State-of-the-art approach: the dummy loads (black shaded area) are used during idle intervals to reduce the amplitude of the power fluctuations. The use of dummy loads leads to a degradation of computational load (with respect to the required power profiles - in black line), because dummy loads deteriorate the thermal profiles of AI accelerators. (B-D): Power trends when solutions for dynamic power response are employed. (B,C): Power trends profiles when passive devices are used. (D): Power trends profiles when actives devices are used.
  • Figure 2: Characteristics of power spikes identified in the real life datasets of AI power trends: histogram of peak energy (expressed in Joule), and summary of the thresholds of the percentiles of the peak durations (top right).
  • Figure 3: Number of GPUs that would not face shutdown when power spikes occurring of length greater than 'Burst length' over the limit of 'Threshold' percentage of the rack maximum power could be absorbed. We assumed each GPU to be modeled around the Nvidia H100 model, i.e., showing instantaneous power draw of 700W when on training phase.