EcoServe: Designing Carbon-Aware AI Inference Systems

Yueying Li; Zhanqiu Hu; Esha Choukse; Rodrigo Fonseca; G. Edward Suh; Udit Gupta

EcoServe: Designing Carbon-Aware AI Inference Systems

Yueying Li, Zhanqiu Hu, Esha Choukse, Rodrigo Fonseca, G. Edward Suh, Udit Gupta

TL;DR

EcoServe addresses the dual challenge of operational and embodied carbon in AI inference by developing a carbon-aware, cross-stack framework that jointly optimizes capacity planning, resource provisioning, and runtime scheduling. It introduces a fine-grained embodied carbon model for host systems and accelerators, and a four-pronged 4R design (Reuse, Rightsize, Reduce, Recycle) to reduce emissions. A cross-stack ILP co-design engine, supported by offline profiling and production traces across heterogeneous hardware, yields 1.4–2.2x total carbon reductions with modest performance impact and up to 47% carbon savings in end-to-end scenarios. The work demonstrates that substantial carbon reductions are achievable in practical deployments by jointly considering hardware heterogeneity, workload phases, and lifecycle management, highlighting a path toward sustainable, high-performance AI infrastructure.

Abstract

The rapid increase in LLM ubiquity and scale levies unprecedented demands on computing infrastructure. These demands not only incur large compute and memory resources but also significant energy, yielding large operational and embodied carbon emissions. In this work, we present three main observations based on modeling and traces from the production deployment of two Generative AI services in a major cloud service provider. First, while GPUs dominate operational carbon, host processing systems (e.g., CPUs, memory, storage) dominate embodied carbon. Second, offline, batch inference accounts for a significant portion (up to 55\%) of serving capacity. Third, there are different levels of heterogeneity across hardware and workloads for LLM inference. Based on these observations, we design EcoServe, a carbon-aware resource provision and scheduling framework for LLM serving systems. It is based on four principles - Reduce, Reuse, Rightsize, and Recycle (4R). With a cross-stack ILP formulation and design, we demonstrate that EcoServe can lower carbon emissions by up to 47\%, compared to performance, energy, and cost-optimized design points, while maintaining performance targets and SLOs.

EcoServe: Designing Carbon-Aware AI Inference Systems

TL;DR

Abstract

EcoServe: Designing Carbon-Aware AI Inference Systems

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (21)