Table of Contents
Fetching ...

Carbon and Reliability-Aware Computing for Heterogeneous Data Centers

Yichao Zhang, Yubo Song, Subham Sahoo

TL;DR

This paper tackles the carbon- and reliability-aware problem of spatio-temporal workload migration in distributed data centers. It introduces a MILP framework that jointly minimizes operational and embodied carbon while accounting for server aging, heterogeneity, and backup resource provisioning to meet SLA. The embodied emissions model links manufacturing footprint to server lifetimes and utilization, and the optimization incorporates interactive and batch workloads, server dispatch with redundancy, and a linearization scheme for tractability. Numerical results on two interconnected DCs show up to 21% total carbon reductions and SLA reliability improvements to under 1% violations, with an optimal server utilization around 0.6 that balances energy efficiency and reliability. The work provides a practical, degradation-aware approach for sustainable and dependable DC operations in heterogeneous, geo-distributed environments.

Abstract

The rapid expansion of data centers (DCs) has intensified energy and carbon footprint, incurring a massive environmental computing cost. While carbon-aware workload migration strategies have been examined, existing approaches often overlook reliability metrics such as server lifetime degradation, and quality-of-service (QoS) that substantially affects both carbon and operational efficiency of DCs. Hence, this paper proposes a comprehensive optimization framework for spatio-temporal workload migration across distributed DCs that jointly minimizes operational and embodied carbon emissions while complying with service-level agreements (SLA). A key contribution is the development of an embodied carbon emission model based on servers' expected lifetime analysis, which explicitly considers server heterogeneity resulting from aging and utilization conditions. These issues are accommodated using new server dispatch strategies, and backup resource allocation model, accounting hardware, software and workload-induced failure. The overall model is formulated as a mixed-integer optimization problem with multiple linearization techniques to ensure computational tractability. Numerical case studies demonstrate that the proposed method reduces total carbon emissions by up to 21%, offering a pragmatic approach to sustainable DC operations.

Carbon and Reliability-Aware Computing for Heterogeneous Data Centers

TL;DR

This paper tackles the carbon- and reliability-aware problem of spatio-temporal workload migration in distributed data centers. It introduces a MILP framework that jointly minimizes operational and embodied carbon while accounting for server aging, heterogeneity, and backup resource provisioning to meet SLA. The embodied emissions model links manufacturing footprint to server lifetimes and utilization, and the optimization incorporates interactive and batch workloads, server dispatch with redundancy, and a linearization scheme for tractability. Numerical results on two interconnected DCs show up to 21% total carbon reductions and SLA reliability improvements to under 1% violations, with an optimal server utilization around 0.6 that balances energy efficiency and reliability. The work provides a practical, degradation-aware approach for sustainable and dependable DC operations in heterogeneous, geo-distributed environments.

Abstract

The rapid expansion of data centers (DCs) has intensified energy and carbon footprint, incurring a massive environmental computing cost. While carbon-aware workload migration strategies have been examined, existing approaches often overlook reliability metrics such as server lifetime degradation, and quality-of-service (QoS) that substantially affects both carbon and operational efficiency of DCs. Hence, this paper proposes a comprehensive optimization framework for spatio-temporal workload migration across distributed DCs that jointly minimizes operational and embodied carbon emissions while complying with service-level agreements (SLA). A key contribution is the development of an embodied carbon emission model based on servers' expected lifetime analysis, which explicitly considers server heterogeneity resulting from aging and utilization conditions. These issues are accommodated using new server dispatch strategies, and backup resource allocation model, accounting hardware, software and workload-induced failure. The overall model is formulated as a mixed-integer optimization problem with multiple linearization techniques to ensure computational tractability. Numerical case studies demonstrate that the proposed method reduces total carbon emissions by up to 21%, offering a pragmatic approach to sustainable DC operations.

Paper Structure

This paper contains 30 sections, 38 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Proposed framework for distributed DCs operation -- accounting failure and server heterogeneity. $\mathrm{P^{PV}}$, $\mathrm{P^{Grid}}$, and $\mathrm{P^{Battery}}$ represent the power from photovoltaic (PV), power grid, and battery, which satisfy the power balance. Additionally, $\mathrm{\Delta W^{B}}$ and $\mathrm{\Delta W^{I}}$ represent the migrated batch and interactive workload strategy, governed by \ref{['eq:WI1']} to \ref{['eq:BI1']}. K server clusters are dispatched to accommodate the workload after migration according to their heterogeneity.
  • Figure 2: Impact of server utilization rate on energy efficiency, lifetime, and failure probability and its consequence on operation economy and emissions.
  • Figure 3: Relationship between expected calendar lifetime and expected operating lifetime. Expected server’s calendar lifetime $T^{\mathrm{T^{Can}}}$ consists of past calendar operating time $T^{\mathrm{T^{PC}}}$ and future calendar operating time $T^{\mathrm{T^{FC}}}$. $T^{\mathrm{T^{FC}}}$ is estimated by the expected future operating lifetime $T^{\mathrm{T^{FO}}}$ and operation during the dispatched day.
  • Figure 4: Number of servers in different statuses. A server cluster includes $\mathrm{N}$ servers during the dispatched day. At time $\mathrm{t}$, there are $\mathrm{N^{F,H}}$ accumulated failed servers due to hardware failure, thus the remaining available servers are $\mathrm{N^{R}}$. $\mathrm{N_{A}}$ servers are in active status to deal with workload or respond to failure events during the operation.
  • Figure 5: Server clustering framework based on repair strategy and operating time. (a) Servers are grouped according to the expected repair and replacement strategy, since the embodied carbon emissions from the two strategies can differ by up to an order of magnitude. (b) Servers within each group are further clustered using K-means based on their accumulated operating time, enabling the identification of degradation-aware clusters. (c) The complete clustering framework combines both repair strategy and operating time to support carbon- and lifetime-aware server dispatching.
  • ...and 5 more figures