Table of Contents
Fetching ...

Megha: Decentralized Global Fair Scheduling for Federated Clusters

Meghana Thiyyakat, Subramaniam Kalambur, Dinkar Sitaram

TL;DR

Megha addresses the challenge of scheduling in federated, heterogeneous data-center clusters by introducing a decentralized global scheduler that leverages flexible partitioning and eventual consistency. The architecture uses Local Masters for cluster control and Global Masters for high-level decisions, with a constraint-driven bit-vector matching mechanism and repartitioning to optimize placement while enforcing global fairness. Evaluation on production traces and a prototype shows Megha achieves median allocation times similar to distributed schedulers and substantially tighter tail latency, illustrating scalable, fair, low-overhead scheduling for large federated clusters. The results suggest Megha’s approach can improve resource utilization and responsiveness in real-world, large-scale data centers.

Abstract

Increasing scale and heterogeneity in data centers have led to the development of federated clusters such as KubeFed, Hydra, and Pigeon, that federate individual data center clusters. In our work, we introduce Megha, a novel decentralized resource management framework for such federated clusters. Megha employs flexible logical partitioning of clusters to distribute its scheduling load, ensuring that the requirements of the workload are satisfied with very low scheduling overheads. It uses a distributed global scheduler that does not rely on a centralized data store but, instead, works with eventual consistency, unlike other schedulers that use a tiered architecture or rely on centralized databases. Our experiments with Megha show that it can schedule tasks taking into account fairness and placement constraints with low resource allocation times - in the order of tens of milliseconds.

Megha: Decentralized Global Fair Scheduling for Federated Clusters

TL;DR

Megha addresses the challenge of scheduling in federated, heterogeneous data-center clusters by introducing a decentralized global scheduler that leverages flexible partitioning and eventual consistency. The architecture uses Local Masters for cluster control and Global Masters for high-level decisions, with a constraint-driven bit-vector matching mechanism and repartitioning to optimize placement while enforcing global fairness. Evaluation on production traces and a prototype shows Megha achieves median allocation times similar to distributed schedulers and substantially tighter tail latency, illustrating scalable, fair, low-overhead scheduling for large federated clusters. The results suggest Megha’s approach can improve resource utilization and responsiveness in real-world, large-scale data centers.

Abstract

Increasing scale and heterogeneity in data centers have led to the development of federated clusters such as KubeFed, Hydra, and Pigeon, that federate individual data center clusters. In our work, we introduce Megha, a novel decentralized resource management framework for such federated clusters. Megha employs flexible logical partitioning of clusters to distribute its scheduling load, ensuring that the requirements of the workload are satisfied with very low scheduling overheads. It uses a distributed global scheduler that does not rely on a centralized data store but, instead, works with eventual consistency, unlike other schedulers that use a tiered architecture or rely on centralized databases. Our experiments with Megha show that it can schedule tasks taking into account fairness and placement constraints with low resource allocation times - in the order of tens of milliseconds.

Paper Structure

This paper contains 31 sections, 6 equations, 12 figures, 2 algorithms.

Figures (12)

  • Figure 1: Megha's architecture with one user queue
  • Figure 2: Simplified representation of a GM's decision making process
  • Figure 3: Distribution of allocation times reported by Megha and Sparrow
  • Figure 4: Median, Mean, 90th percentile,99th percentile, 99.9th percentile and 99.99th percentile Allocation time recorded for Megha and Sparrow
  • Figure 5: Total number of worker slots required by the workload sample
  • ...and 7 more figures