Table of Contents
Fetching ...

CarbonFlex: Enabling Carbon-aware Provisioning and Scheduling for Cloud Clusters

Walid A. Hanafy, Li Wu, David Irwin, Prashant Shenoy

TL;DR

CarbonFlex introduces a carbon-aware resource manager for cloud clusters that jointly optimizes provisioning and scheduling of elastic batch jobs using continuous learning from historical traces. It separates provisioning from scheduling, applies elastic scaling to both tasks, and learns from an offline oracle to guide runtime decisions, achieving up to $57\%$ reductions in carbon emissions and $<2.1\%$ deviation from the oracle across CPU and GPU workloads and multiple locations. The approach is implemented on AWS ParallelCluster and evaluated with real-world traces, showing strong performance gains over state-of-the-art baselines like GAIA, WaitAwhile, and CarbonScaler, while maintaining SLOs. The work demonstrates how historical learning can enable carbon-efficient operations in cloud clusters and offers a practical pathway for integrating carbon-aware provisioning with existing schedulers and provisioning strategies.

Abstract

Accelerating computing demand, largely from AI applications, has led to concerns about its carbon footprint. Fortunately, a significant fraction of computing demand comes from batch jobs that are often delay-tolerant and elastic, which enables schedulers to reduce carbon by suspending/resuming jobs and scaling their resources down/up when carbon is high/low. However, prior work on carbon-aware scheduling generally focuses on optimizing carbon for individual jobs in the cloud, and not provisioning and scheduling resources for many parallel jobs in cloud clusters. To address the problem, we present CarbonFlex, a carbon-aware resource provisioning and scheduling approach for cloud clusters. CarbonFlex leverages continuous learning over historical cluster-level data to drive near-optimal runtime resource provisioning and job scheduling. We implement CarbonFlex by extending AWS ParallelCluster to include our carbon-aware provisioning and scheduling algorithms. Our evaluation on publicly available industry workloads shows that CarbonFlex decreases carbon emissions by $\sim$57\% compared to a carbon-agnostic baseline and performs within 2.1\% of an oracle scheduler with perfect knowledge of future carbon intensity and job length.

CarbonFlex: Enabling Carbon-aware Provisioning and Scheduling for Cloud Clusters

TL;DR

CarbonFlex introduces a carbon-aware resource manager for cloud clusters that jointly optimizes provisioning and scheduling of elastic batch jobs using continuous learning from historical traces. It separates provisioning from scheduling, applies elastic scaling to both tasks, and learns from an offline oracle to guide runtime decisions, achieving up to reductions in carbon emissions and deviation from the oracle across CPU and GPU workloads and multiple locations. The approach is implemented on AWS ParallelCluster and evaluated with real-world traces, showing strong performance gains over state-of-the-art baselines like GAIA, WaitAwhile, and CarbonScaler, while maintaining SLOs. The work demonstrates how historical learning can enable carbon-efficient operations in cloud clusters and offers a practical pathway for integrating carbon-aware provisioning with existing schedulers and provisioning strategies.

Abstract

Accelerating computing demand, largely from AI applications, has led to concerns about its carbon footprint. Fortunately, a significant fraction of computing demand comes from batch jobs that are often delay-tolerant and elastic, which enables schedulers to reduce carbon by suspending/resuming jobs and scaling their resources down/up when carbon is high/low. However, prior work on carbon-aware scheduling generally focuses on optimizing carbon for individual jobs in the cloud, and not provisioning and scheduling resources for many parallel jobs in cloud clusters. To address the problem, we present CarbonFlex, a carbon-aware resource provisioning and scheduling approach for cloud clusters. CarbonFlex leverages continuous learning over historical cluster-level data to drive near-optimal runtime resource provisioning and job scheduling. We implement CarbonFlex by extending AWS ParallelCluster to include our carbon-aware provisioning and scheduling algorithms. Our evaluation on publicly available industry workloads shows that CarbonFlex decreases carbon emissions by 57\% compared to a carbon-agnostic baseline and performs within 2.1\% of an oracle scheduler with perfect knowledge of future carbon intensity and job length.

Paper Structure

This paper contains 22 sections, 1 theorem, 1 equation, 14 figures, 3 tables, 3 algorithms.

Key Result

theorem 1

alg:offline yields optimal carbon savings for homogeneous clusters and monotonically decreasing marginal throughput profiles.

Figures (14)

  • Figure 1: Carbon Intensity Variations in four locations in the first week of April 2022.
  • Figure 2: Elastic scaling profiles of different MPI and machine learning jobs that depict the marginal increase in throughput for each additional server.
  • Figure 3: Overview of the learning and execution phases of CarbonFlex.
  • Figure 4: Representing the decisions made by CarbonFlex(Oracle) as a provisioning and scheduling policy.
  • Figure 5: Diversity in selected Carbon Intensity traces.
  • ...and 9 more figures

Theorems & Definitions (2)

  • theorem 1
  • proof