Table of Contents
Fetching ...

Eva: Cost-Efficient Cloud-Based Cluster Scheduling

Tzu-Tao Chang, Shivaram Venkataraman

TL;DR

This work tackles cost-efficient hosting of batch jobs on cloud-based clusters by jointly optimizing task placement and instance provisioning. It introduces Eva, a reservation-price–based scheduler that accounts for co-location interference and migration overhead via Full and Partial Reconfiguration, implemented in a modular master-worker system with a simulator. Empirical results across physical AWS experiments and large-scale Alibaba traces show substantial cost reductions (up to 42%) with modest increases in JCT (around 15%), demonstrating that coordinated packing and provisioning can outperform isolated, per-task provisioning. The approach offers practical benefits for cloud data centers hosting heterogeneous workloads, providing a scalable, interference-aware framework for dynamic cluster reconfiguration.

Abstract

Cloud computing offers flexibility in resource provisioning, allowing an organization to host its batch processing workloads cost-efficiently by dynamically scaling the size and composition of a cloud-based cluster -- a collection of instances provisioned from the cloud. However, existing schedulers fail to minimize total cost due to suboptimal task and instance scheduling strategies, interference between co-located tasks, and instance provisioning overheads. We present Eva, a scheduler for cloud-based clusters that reduces the overall cost of hosting long-running batch jobs. Eva leverages reservation price from economics to derive the optimal set of instances to provision and task-to-instance assignments. Eva also takes into account performance degradation when co-locating tasks and quantitatively evaluates the trade-off between short-term migration overhead and long-term provision savings when considering a change in cluster configuration. Experiments on AWS EC2 and large-scale trace-driven simulations demonstrate that Eva reduces costs by 42\% while incurring only a 15\% increase in JCT, compared to provisioning a separate instance for each task.

Eva: Cost-Efficient Cloud-Based Cluster Scheduling

TL;DR

This work tackles cost-efficient hosting of batch jobs on cloud-based clusters by jointly optimizing task placement and instance provisioning. It introduces Eva, a reservation-price–based scheduler that accounts for co-location interference and migration overhead via Full and Partial Reconfiguration, implemented in a modular master-worker system with a simulator. Empirical results across physical AWS experiments and large-scale Alibaba traces show substantial cost reductions (up to 42%) with modest increases in JCT (around 15%), demonstrating that coordinated packing and provisioning can outperform isolated, per-task provisioning. The approach offers practical benefits for cloud data centers hosting heterogeneous workloads, providing a scalable, interference-aware framework for dynamic cluster reconfiguration.

Abstract

Cloud computing offers flexibility in resource provisioning, allowing an organization to host its batch processing workloads cost-efficiently by dynamically scaling the size and composition of a cloud-based cluster -- a collection of instances provisioned from the cloud. However, existing schedulers fail to minimize total cost due to suboptimal task and instance scheduling strategies, interference between co-located tasks, and instance provisioning overheads. We present Eva, a scheduler for cloud-based clusters that reduces the overall cost of hosting long-running batch jobs. Eva leverages reservation price from economics to derive the optimal set of instances to provision and task-to-instance assignments. Eva also takes into account performance degradation when co-locating tasks and quantitatively evaluates the trade-off between short-term migration overhead and long-term provision savings when considering a change in cluster configuration. Experiments on AWS EC2 and large-scale trace-driven simulations demonstrate that Eva reduces costs by 42\% while incurring only a 15\% increase in JCT, compared to provisioning a separate instance for each task.

Paper Structure

This paper contains 35 sections, 3 equations, 8 figures, 14 tables, 1 algorithm.

Figures (8)

  • Figure 1: Performance of batch jobs when co-located on the same instance. Each cell shows the normalized throughput of Workload 1 when co-located with Workload 2. Both workloads receive the resources they requested, as listed in Table \ref{['tab:evaluate-workload']}, and are assigned to separate GPUs and CPUs on the same instance. The jobs start simultaneously and run for 10 minutes. Throughput is measured for each job during this period and normalized by dividing it by the job's standalone throughput on an instance without co-location.
  • Figure 2: Eva architecture.
  • Figure 3: Instance uptimes with 120 jobs.
  • Figure 4: Impact of co-location interference.
  • Figure 5: Impact of migration overhead. $2 \times$ means each job's migration delay is set to twice its original delay duration.
  • ...and 3 more figures