Table of Contents
Fetching ...

Coca4ai: checking energy behaviors on AI data centers

Paul Gay, Éric Bilinski, Anne-Laure Ligozat

TL;DR

The paper tackles the growing environmental footprint of AI data centers by presenting a lightweight, per-job energy profiling system deployed on the labia data center. It combines software wattmeters (NVIDIA-smi, RAPL) with external wattmeters for validation, using SLURM-driven job accounting to attribute power to individual PIDs via relative CPU time. Key findings show a 16% alignment gap between software and external measurements due to unmonitored components, and that non-completed jobs dominate energy use, with GPUs underutilized across observed workloads. The work demonstrates a practical, open-source approach to monitor energy behaviors at data-center scale, offering actionable insights for efficiency improvements and early-stage user engagement in energy-aware practices.

Abstract

Monitoring energy behaviors in AI data centers is crucial, both to reduce their energy consumption and to raise awareness among their users which are key actors in the AI field. This paper shows a proof of concept of easy and lightweight monitoring of energy behaviors at the scale of a whole data center, a user or a job submission. Our system uses software wattmeters and we validate our setup with per node accurate external wattmeters. Results show that there is an interesting potential from the efficiency point of view, providing arguments to create user engagement thanks to energy monitoring.

Coca4ai: checking energy behaviors on AI data centers

TL;DR

The paper tackles the growing environmental footprint of AI data centers by presenting a lightweight, per-job energy profiling system deployed on the labia data center. It combines software wattmeters (NVIDIA-smi, RAPL) with external wattmeters for validation, using SLURM-driven job accounting to attribute power to individual PIDs via relative CPU time. Key findings show a 16% alignment gap between software and external measurements due to unmonitored components, and that non-completed jobs dominate energy use, with GPUs underutilized across observed workloads. The work demonstrates a practical, open-source approach to monitor energy behaviors at data-center scale, offering actionable insights for efficiency improvements and early-stage user engagement in energy-aware practices.

Abstract

Monitoring energy behaviors in AI data centers is crucial, both to reduce their energy consumption and to raise awareness among their users which are key actors in the AI field. This paper shows a proof of concept of easy and lightweight monitoring of energy behaviors at the scale of a whole data center, a user or a job submission. Our system uses software wattmeters and we validate our setup with per node accurate external wattmeters. Results show that there is an interesting potential from the efficiency point of view, providing arguments to create user engagement thanks to energy monitoring.
Paper Structure (4 sections, 2 figures, 1 table)

This paper contains 4 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Global view of our setup to profile CPU and GPU usages with software and external powermeters. Usage and power draws are attributed to each PID corresponding to each job launched by SLURM.
  • Figure 2: Histograms of GPU SM cores and memory GPU usage. Two peaks are present in the distribution but none of the jobs is using the GPUs at their full capacity