Table of Contents
Fetching ...

EcoServe: Designing Carbon-Aware AI Inference Systems

Yueying Li, Zhanqiu Hu, Esha Choukse, Rodrigo Fonseca, G. Edward Suh, Udit Gupta

TL;DR

EcoServe addresses the dual challenge of operational and embodied carbon in AI inference by developing a carbon-aware, cross-stack framework that jointly optimizes capacity planning, resource provisioning, and runtime scheduling. It introduces a fine-grained embodied carbon model for host systems and accelerators, and a four-pronged 4R design (Reuse, Rightsize, Reduce, Recycle) to reduce emissions. A cross-stack ILP co-design engine, supported by offline profiling and production traces across heterogeneous hardware, yields 1.4–2.2x total carbon reductions with modest performance impact and up to 47% carbon savings in end-to-end scenarios. The work demonstrates that substantial carbon reductions are achievable in practical deployments by jointly considering hardware heterogeneity, workload phases, and lifecycle management, highlighting a path toward sustainable, high-performance AI infrastructure.

Abstract

The rapid increase in LLM ubiquity and scale levies unprecedented demands on computing infrastructure. These demands not only incur large compute and memory resources but also significant energy, yielding large operational and embodied carbon emissions. In this work, we present three main observations based on modeling and traces from the production deployment of two Generative AI services in a major cloud service provider. First, while GPUs dominate operational carbon, host processing systems (e.g., CPUs, memory, storage) dominate embodied carbon. Second, offline, batch inference accounts for a significant portion (up to 55\%) of serving capacity. Third, there are different levels of heterogeneity across hardware and workloads for LLM inference. Based on these observations, we design EcoServe, a carbon-aware resource provision and scheduling framework for LLM serving systems. It is based on four principles - Reduce, Reuse, Rightsize, and Recycle (4R). With a cross-stack ILP formulation and design, we demonstrate that EcoServe can lower carbon emissions by up to 47\%, compared to performance, energy, and cost-optimized design points, while maintaining performance targets and SLOs.

EcoServe: Designing Carbon-Aware AI Inference Systems

TL;DR

EcoServe addresses the dual challenge of operational and embodied carbon in AI inference by developing a carbon-aware, cross-stack framework that jointly optimizes capacity planning, resource provisioning, and runtime scheduling. It introduces a fine-grained embodied carbon model for host systems and accelerators, and a four-pronged 4R design (Reuse, Rightsize, Reduce, Recycle) to reduce emissions. A cross-stack ILP co-design engine, supported by offline profiling and production traces across heterogeneous hardware, yields 1.4–2.2x total carbon reductions with modest performance impact and up to 47% carbon savings in end-to-end scenarios. The work demonstrates that substantial carbon reductions are achievable in practical deployments by jointly considering hardware heterogeneity, workload phases, and lifecycle management, highlighting a path toward sustainable, high-performance AI infrastructure.

Abstract

The rapid increase in LLM ubiquity and scale levies unprecedented demands on computing infrastructure. These demands not only incur large compute and memory resources but also significant energy, yielding large operational and embodied carbon emissions. In this work, we present three main observations based on modeling and traces from the production deployment of two Generative AI services in a major cloud service provider. First, while GPUs dominate operational carbon, host processing systems (e.g., CPUs, memory, storage) dominate embodied carbon. Second, offline, batch inference accounts for a significant portion (up to 55\%) of serving capacity. Third, there are different levels of heterogeneity across hardware and workloads for LLM inference. Based on these observations, we design EcoServe, a carbon-aware resource provision and scheduling framework for LLM serving systems. It is based on four principles - Reduce, Reuse, Rightsize, and Recycle (4R). With a cross-stack ILP formulation and design, we demonstrate that EcoServe can lower carbon emissions by up to 47\%, compared to performance, energy, and cost-optimized design points, while maintaining performance targets and SLOs.

Paper Structure

This paper contains 30 sections, 7 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: (Left) Breakdown of thermal design power provision (TDP) and embodied carbon between host systems (CPU) and GPU. (Right) Through the 4R strategy, EcoServe optimizes the carbon savings for various AI workloads.
  • Figure 2: EcoServe's carbon modeling framework with more fine-grained embodied carbon estimation on memory, storage, and power-related components. We highlighted the differences with ACT in the red box.
  • Figure 3: Trends in bit density (left) and embodied carbon footprint (right) across various DRAM memory technologies for 3 different manufacturers.
  • Figure 4: Trends in embodied carbon, power, and cloud cost for different generations of GPUs. As GPU performance increases (left to right), power consumption, cost and embodied carbon exhibit distinct trends, ACT only accounts for around 20% in the blue SoC component aws-gpuazure-gpulambdalabsCloudDeep.
  • Figure 5: Embodied carbon breakdown of full inference systems available in cloud offerings from Azure and LambdaLabs azure-gpulambdalabsCloudDeep, varying the number and type of GPU. Host-processing systems account for over half of the embodied carbon in AI systems, owing largely to memory, storage, and mainboard overheads.
  • ...and 16 more figures