CarbonFlex: Enabling Carbon-aware Provisioning and Scheduling for Cloud Clusters
Walid A. Hanafy, Li Wu, David Irwin, Prashant Shenoy
TL;DR
CarbonFlex introduces a carbon-aware resource manager for cloud clusters that jointly optimizes provisioning and scheduling of elastic batch jobs using continuous learning from historical traces. It separates provisioning from scheduling, applies elastic scaling to both tasks, and learns from an offline oracle to guide runtime decisions, achieving up to $57\%$ reductions in carbon emissions and $<2.1\%$ deviation from the oracle across CPU and GPU workloads and multiple locations. The approach is implemented on AWS ParallelCluster and evaluated with real-world traces, showing strong performance gains over state-of-the-art baselines like GAIA, WaitAwhile, and CarbonScaler, while maintaining SLOs. The work demonstrates how historical learning can enable carbon-efficient operations in cloud clusters and offers a practical pathway for integrating carbon-aware provisioning with existing schedulers and provisioning strategies.
Abstract
Accelerating computing demand, largely from AI applications, has led to concerns about its carbon footprint. Fortunately, a significant fraction of computing demand comes from batch jobs that are often delay-tolerant and elastic, which enables schedulers to reduce carbon by suspending/resuming jobs and scaling their resources down/up when carbon is high/low. However, prior work on carbon-aware scheduling generally focuses on optimizing carbon for individual jobs in the cloud, and not provisioning and scheduling resources for many parallel jobs in cloud clusters. To address the problem, we present CarbonFlex, a carbon-aware resource provisioning and scheduling approach for cloud clusters. CarbonFlex leverages continuous learning over historical cluster-level data to drive near-optimal runtime resource provisioning and job scheduling. We implement CarbonFlex by extending AWS ParallelCluster to include our carbon-aware provisioning and scheduling algorithms. Our evaluation on publicly available industry workloads shows that CarbonFlex decreases carbon emissions by $\sim$57\% compared to a carbon-agnostic baseline and performs within 2.1\% of an oracle scheduler with perfect knowledge of future carbon intensity and job length.
