Table of Contents
Fetching ...

Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Ricardo Bianchini

TL;DR

AI datacenters face extreme TCO due to rapid LLM-driven demand and dense accelerator hardware. The authors propose a $TCO = \,\mathrm{CapEx} + \,\mathrm{OpEx}$ framework spanning build, IT provisioning, and operation, and demonstrate cross-stage optimization that reduces total cost by up to 40%. By modeling workload growth, hardware roadmaps, and aging, the approach shows stage-specific gains (roughly $15\%$, $23\%$, and $19\%$) and a holistic strategy that outperforms traditional lifecycles. The work provides lifecycle-aware design guidelines, including flatter power delivery, hybrid cooling, hierarchical networking, flexible refresh policies, and heterogeneity-aware scheduling, to enable scalable, cost-efficient AI datacenters.

Abstract

The rapid rise of large language models (LLMs) has been driving an enormous demand for AI inference infrastructure, mainly powered by high-end GPUs. While these accelerators offer immense computational power, they incur high capital and operational costs due to frequent upgrades, dense power consumption, and cooling demands, making total cost of ownership (TCO) for AI datacenters a critical concern for cloud providers. Unfortunately, traditional datacenter lifecycle management (designed for general-purpose workloads) struggles to keep pace with AI's fast-evolving models, rising resource needs, and diverse hardware profiles. In this paper, we rethink the AI datacenter lifecycle scheme across three stages: building, hardware refresh, and operation. We show how design choices in power, cooling, and networking provisioning impact long-term TCO. We also explore refresh strategies aligned with hardware trends. Finally, we use operation software optimizations to reduce cost. While these optimizations at each stage yield benefits, unlocking the full potential requires rethinking the entire lifecycle. Thus, we present a holistic lifecycle management framework that coordinates and co-optimizes decisions across all three stages, accounting for workload dynamics, hardware evolution, and system aging. Our system reduces the TCO by up to 40\% over traditional approaches. Using our framework we provide guidelines on how to manage AI datacenter lifecycle for the future.

Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework

TL;DR

AI datacenters face extreme TCO due to rapid LLM-driven demand and dense accelerator hardware. The authors propose a framework spanning build, IT provisioning, and operation, and demonstrate cross-stage optimization that reduces total cost by up to 40%. By modeling workload growth, hardware roadmaps, and aging, the approach shows stage-specific gains (roughly , , and ) and a holistic strategy that outperforms traditional lifecycles. The work provides lifecycle-aware design guidelines, including flatter power delivery, hybrid cooling, hierarchical networking, flexible refresh policies, and heterogeneity-aware scheduling, to enable scalable, cost-efficient AI datacenters.

Abstract

The rapid rise of large language models (LLMs) has been driving an enormous demand for AI inference infrastructure, mainly powered by high-end GPUs. While these accelerators offer immense computational power, they incur high capital and operational costs due to frequent upgrades, dense power consumption, and cooling demands, making total cost of ownership (TCO) for AI datacenters a critical concern for cloud providers. Unfortunately, traditional datacenter lifecycle management (designed for general-purpose workloads) struggles to keep pace with AI's fast-evolving models, rising resource needs, and diverse hardware profiles. In this paper, we rethink the AI datacenter lifecycle scheme across three stages: building, hardware refresh, and operation. We show how design choices in power, cooling, and networking provisioning impact long-term TCO. We also explore refresh strategies aligned with hardware trends. Finally, we use operation software optimizations to reduce cost. While these optimizations at each stage yield benefits, unlocking the full potential requires rethinking the entire lifecycle. Thus, we present a holistic lifecycle management framework that coordinates and co-optimizes decisions across all three stages, accounting for workload dynamics, hardware evolution, and system aging. Our system reduces the TCO by up to 40\% over traditional approaches. Using our framework we provide guidelines on how to manage AI datacenter lifecycle for the future.

Paper Structure

This paper contains 29 sections, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Hosting AI workloads from models to hardware and supporting datacenter infrastructure.
  • Figure 2: The P50, P99, and average size of the most popular AI models published in the last decade.
  • Figure 3: TCO breakdown for a 10MW AI datacenter.
  • Figure 4: Server count by GPU type over time in an AI fleet following the traditional baseline in \ref{['table:traditionalDCLifecycle']}. Includes the release dates of major AI models.
  • Figure 5: Per-server TDP across Intel and AMD CPUs, and NVIDIA and AMD GPUs over the years.
  • ...and 10 more figures