Table of Contents
Fetching ...

Carbon- and Precedence-Aware Scheduling for Data Processing Clusters

Adam Lechowicz, Rohan Shenoy, Noman Bashir, Mohammad Hajiesmaili, Adam Wierman, Christina Delimitrou

TL;DR

This work tackles the carbon footprint of data processing by integrating time-varying carbon intensity with the precedence-driven structure of DAG-based workloads. It introduces PCAPS, a carbon-aware scheduler that leverages task importance within a probabilistic DAG scheduling framework, and CAP, a provisioning-based wrapper compatible with any underlying scheduler. The authors provide theoretical analyses, including the Carbon Stretch Factor and carbon-savings bounds, and validate the approach through both a 100-node Spark-on-Kubernetes prototype and a high-fidelity Spark simulator, showing substantial carbon reductions with modest increases in end-to-end throughput. The results indicate that coupling carbon-aware decisions with DAG structure yields meaningful reductions in carbon footprints in realistic grid environments, offering a practical path toward greener data-processing clusters. Overall, PCAPS demonstrates a principled, tunable balance between carbon reduction and job completion time, while CAP offers an easier-to-implement option with strong, broadly applicable benefits.

Abstract

As large-scale data processing workloads continue to grow, their carbon footprint raises concerns. Prior research on carbon-aware schedulers has focused on shifting computation to align with availability of low-carbon energy, but these approaches assume that each task can be executed independently. In contrast, data processing jobs have precedence constraints (i.e., outputs of one task are inputs for another) that complicate decisions, since delaying an upstream ``bottleneck'' task to a low-carbon period will also block downstream tasks, impacting the entire job's completion time. In this paper, we show that carbon-aware scheduling for data processing benefits from knowledge of both time-varying carbon and precedence constraints. Our main contribution is $\texttt{PCAPS}$, a carbon-aware scheduler that interfaces with modern ML scheduling policies to explicitly consider the precedence-driven importance of each task in addition to carbon. To illustrate the gains due to fine-grained task information, we also study $\texttt{CAP}$, a wrapper for any carbon-agnostic scheduler that adapts the key provisioning ideas of $\texttt{PCAPS}$. Our schedulers enable a configurable priority between carbon reduction and job completion time, and we give analytical results characterizing the trade-off between the two. Furthermore, our Spark prototype on a 100-node Kubernetes cluster shows that a moderate configuration of $\texttt{PCAPS}$ reduces carbon footprint by up to 32.9% without significantly impacting the cluster's total efficiency.

Carbon- and Precedence-Aware Scheduling for Data Processing Clusters

TL;DR

This work tackles the carbon footprint of data processing by integrating time-varying carbon intensity with the precedence-driven structure of DAG-based workloads. It introduces PCAPS, a carbon-aware scheduler that leverages task importance within a probabilistic DAG scheduling framework, and CAP, a provisioning-based wrapper compatible with any underlying scheduler. The authors provide theoretical analyses, including the Carbon Stretch Factor and carbon-savings bounds, and validate the approach through both a 100-node Spark-on-Kubernetes prototype and a high-fidelity Spark simulator, showing substantial carbon reductions with modest increases in end-to-end throughput. The results indicate that coupling carbon-aware decisions with DAG structure yields meaningful reductions in carbon footprints in realistic grid environments, offering a practical path toward greener data-processing clusters. Overall, PCAPS demonstrates a principled, tunable balance between carbon reduction and job completion time, while CAP offers an easier-to-implement option with strong, broadly applicable benefits.

Abstract

As large-scale data processing workloads continue to grow, their carbon footprint raises concerns. Prior research on carbon-aware schedulers has focused on shifting computation to align with availability of low-carbon energy, but these approaches assume that each task can be executed independently. In contrast, data processing jobs have precedence constraints (i.e., outputs of one task are inputs for another) that complicate decisions, since delaying an upstream ``bottleneck'' task to a low-carbon period will also block downstream tasks, impacting the entire job's completion time. In this paper, we show that carbon-aware scheduling for data processing benefits from knowledge of both time-varying carbon and precedence constraints. Our main contribution is , a carbon-aware scheduler that interfaces with modern ML scheduling policies to explicitly consider the precedence-driven importance of each task in addition to carbon. To illustrate the gains due to fine-grained task information, we also study , a wrapper for any carbon-agnostic scheduler that adapts the key provisioning ideas of . Our schedulers enable a configurable priority between carbon reduction and job completion time, and we give analytical results characterizing the trade-off between the two. Furthermore, our Spark prototype on a 100-node Kubernetes cluster shows that a moderate configuration of reduces carbon footprint by up to 32.9% without significantly impacting the cluster's total efficiency.

Paper Structure

This paper contains 33 sections, 6 theorems, 32 equations, 20 figures, 3 tables, 1 algorithm.

Key Result

Theorem 4.3

For time-varying carbon intensities given by ${\mathbf{c}}$, the carbon stretch factor of PCAPS is $1 + \frac{\mathcal{D}(\gamma, {\mathbf{c}}) K}{2 - \frac{1}{K}}$.

Figures (20)

  • Figure 1: Four scheduling policies for a motivating DAG and 18-hour-long carbon intensity trace (on the left hand side). Compared to a carbon-agnostic FIFO scheduler, the time-optimal approach (T-OPT) prioritizes starting the green and purple stages early to reduce completion time. A carbon-aware-optimal approach (C-OPT) with a deadline to finish the DAG within 18 hours reduces carbon emissions by 51.2%, at the expense of increasing time by 28.5% compared to FIFO. By prioritizing green and purple stages during high-carbon periods, PCAPS reduces carbon emissions by 23.1% and still completes the job 7% earlier compared to FIFO.
  • Figure 2: PCAPS interfaces with a probabilistic (PB) scheduling policy. Given a probability distribution over nodes ➊, PCAPS computes a relative importance score ➋ that is used to determine which nodes should run based on the current carbon intensity ➌ -- e.g., bottleneck nodes impeding job completion run regardless of carbon ➍, while less important nodes can be deferred for lower carbon periods ➎.
  • Figure 3: Illustrating PCAPS's carbon-awareness filter. Jobs A and B are DAGs found in TPC-H queries and Alibaba traces, respectively TPCH:18Alibaba:18. Highlighted nodes explain two scheduling outcomes. In job A, the sampled node has low relative importance, so it is deferred. In contrast, job B's sampled node is a bottleneck task with high relative importance: even when the current carbon intensity is high, such tasks are scheduled to avoid increasing job completion time.
  • Figure 4: The CAP (Carbon-Aware Provisioning) module interacts directly with a cluster manager to specify the amount of resources (e.g., no. of machines) that can be used at any given time, based on a carbon intensity signal. CAP can be implemented without changes to an existing scheduling policy and/or the cluster manager.
  • Figure 5: Time-varying carbon intensity for six grids (detailed in \ref{['tab:characteristics']}) over 48 hours in January 2021.
  • ...and 15 more figures

Theorems & Definitions (16)

  • Definition 3.1: Carbon Stretch Factor (CSF)
  • Definition 3.2: Carbon Savings
  • Definition 4.1: Probabilistic Scheduler
  • Definition 4.2: Relative Importance
  • Theorem 4.3
  • Theorem 4.4
  • Theorem 4.5
  • Theorem 4.6
  • proof
  • proof
  • ...and 6 more