Table of Contents
Fetching ...

Dispatching Odyssey: Exploring Performance in Computing Clusters under Real-world Workloads

Mert Yildiz, Alexey Rolich, Andrea Baiocchi

TL;DR

This work analyzes dispatching policies in multi-server clusters using real Google Borg workload traces to reveal how policy choice and system architecture shape response times under heavy-tailed workloads. Through data-driven simulations and a compact G/G/M analytical framework, it compares Round Robin, Least Work Left, and Join Idle Queue policies at both job and task levels, and demonstrates that simple policies can outperform more complex ones when paired with astute architectural design, such as a two-stage dispatching system. Key findings show that LWL and JIQ perform best at the job level only up to an optimal degree of parallelism, while at the task level JIQ can outperform LWL due to task-level correlations and monster jobs; the two-stage architecture can significantly reduce mean response time, especially with modest server counts. The results underscore the importance of workload structure in performance modeling and provide design directions for scalable, real-world data-center scheduling that balances simplicity and effectiveness.

Abstract

Recent workload measurements in Google data centers provide an opportunity to challenge existing models and, more broadly, to enhance the understanding of dispatching policies in computing clusters. Through extensive data-driven simulations, we aim to highlight the key features of workload traffic traces that influence response time performance under simple yet representative dispatching policies. For a given computational power budget, we vary the cluster size, i.e., the number of available servers. A job-level analysis reveals that Join Idle Queue (JIQ) and Least Work Left (LWL) exhibit an optimal working point for a fixed utilization coefficient as the number of servers is varied, whereas Round Robin (RR) demonstrates monotonously worsening performance. Additionally, we explore the accuracy of simple G/G queue approximations. When decomposing jobs into tasks, interesting results emerge; notably, the simpler, non-size-based policy JIQ appears to outperform the more "powerful" size-based LWL policy. Complementing these findings, we present preliminary results on a two-stage scheduling approach that partitions tasks based on service thresholds, illustrating that modest architectural modifications can further enhance performance under realistic workload conditions. We provide insights into these results and suggest promising directions for fully explaining the observed phenomena.

Dispatching Odyssey: Exploring Performance in Computing Clusters under Real-world Workloads

TL;DR

This work analyzes dispatching policies in multi-server clusters using real Google Borg workload traces to reveal how policy choice and system architecture shape response times under heavy-tailed workloads. Through data-driven simulations and a compact G/G/M analytical framework, it compares Round Robin, Least Work Left, and Join Idle Queue policies at both job and task levels, and demonstrates that simple policies can outperform more complex ones when paired with astute architectural design, such as a two-stage dispatching system. Key findings show that LWL and JIQ perform best at the job level only up to an optimal degree of parallelism, while at the task level JIQ can outperform LWL due to task-level correlations and monster jobs; the two-stage architecture can significantly reduce mean response time, especially with modest server counts. The results underscore the importance of workload structure in performance modeling and provide design directions for scalable, real-world data-center scheduling that balances simplicity and effectiveness.

Abstract

Recent workload measurements in Google data centers provide an opportunity to challenge existing models and, more broadly, to enhance the understanding of dispatching policies in computing clusters. Through extensive data-driven simulations, we aim to highlight the key features of workload traffic traces that influence response time performance under simple yet representative dispatching policies. For a given computational power budget, we vary the cluster size, i.e., the number of available servers. A job-level analysis reveals that Join Idle Queue (JIQ) and Least Work Left (LWL) exhibit an optimal working point for a fixed utilization coefficient as the number of servers is varied, whereas Round Robin (RR) demonstrates monotonously worsening performance. Additionally, we explore the accuracy of simple G/G queue approximations. When decomposing jobs into tasks, interesting results emerge; notably, the simpler, non-size-based policy JIQ appears to outperform the more "powerful" size-based LWL policy. Complementing these findings, we present preliminary results on a two-stage scheduling approach that partitions tasks based on service thresholds, illustrating that modest architectural modifications can further enhance performance under realistic workload conditions. We provide insights into these results and suggest promising directions for fully explaining the observed phenomena.

Paper Structure

This paper contains 11 sections, 10 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Mean response time of a cluster of $n$ servers as a function of $n$. Lines are computed by using analytical models, and markers with the dashed lines correspond to data-driven simulations based on the workload of day 4 of the Google trace of cell c. The utilization coefficient of servers is $\rho_0 = 0.8$. Note: The data used includes original values with outliers and is not shuffled.
  • Figure 2: Decomposition of original data and comparative analysis of mean response time of a cluster of $n$ servers as a function of $n$ under different conditions.
  • Figure 3: Mean response time of a cluster of $n$ servers as a function of $n$. Lines are computed by using analytical models, and markers with the dashed lines correspond to data-driven simulations based on the workload of day 4 of the Google trace of cell c. The utilization coefficient of servers is $\rho_0 = 0.8$. Note: The data used does not include outliers but both IAT and CPU values are kept unchanged.
  • Figure 4: Mean response time of a cluster of $n$ servers as a function of $n$. Lines are computed by using analytical models, and markers with the dashed lines correspond to data-driven simulations based on the workload of day 4 of the Google trace of cell c. The utilization coefficient of servers is $\rho_0 = 0.8$. Note: The data used does not include outliers and both IAT and CPU values are shuffled.
  • Figure 5: Mean response time of dispatching algorithms for varying server counts over 31 days in May, shown as boxplots with the yellow line indicating the median of the mean response time. The boxplots visualize the spread.
  • ...and 5 more figures