Table of Contents
Fetching ...

Mitigating Interference of Microservices with a Scoring Mechanism in Large-scale Clusters

Dingyu Yang, Kangpeng Zheng, Shiyou Qian, Jian Cao, Guangtao Xue

TL;DR

This paper addresses interference in large-scale clusters where latency-critical services (LCSs) share servers with BEJs. It introduces PISM, a data-driven, proactive framework that characterizes BEJs, models the impact of BEJ compositions on LCS response times, and uses an interference scoring mechanism to guide BEJ scheduling. Key contributions include BEJ task characterization and 60-category BE task encoding, BEJ similarity clustering via Graph Kernel, two interference scoring models (for LCSs and servers) with k=10 levels, and interference prediction based on server BE compositions. Extensive evaluations on large traces and a 14-node cluster show PISM can reduce server interference by up to 41.5% and boost tail-heavy LCS throughput by up to 76.4%, while remaining integrable with common schedulers and scalable to production environments.

Abstract

Co-locating latency-critical services (LCSs) and best-effort jobs (BEJs) constitute the principal approach for enhancing resource utilization in production. Nevertheless, the co-location practice hurts the performance of LCSs due to resource competition, even when employing isolation technology. Through an extensive analysis of voluminous real trace data derived from two production clusters, we observe that BEJs typically exhibit periodic execution patterns and serve as the primary sources of interference to LCSs. Furthermore, despite occupying the same level of resource consumption, the diverse compositions of BEJs can result in varying degrees of interference on LCSs. Subsequently, we propose PISM, a proactive Performance Interference Scoring and Mitigating framework for LCSs through the optimization of BEJ scheduling. Firstly, PISM adopts a data-driven approach to establish a characterization and classification methodology for BEJs. Secondly, PISM models the relationship between the composition of BEJs on servers and the response time (RT) of LCSs. Thirdly, PISM establishes an interference scoring mechanism in terms of RT, which serves as the foundation for BEJ scheduling. We assess the effectiveness of PISM on a small-scale cluster and through extensive data-driven simulations. The experiment results demonstrate that PISM can reduce cluster interference by up to 41.5%, and improve the throughput of long-tail LCSs by 76.4%.

Mitigating Interference of Microservices with a Scoring Mechanism in Large-scale Clusters

TL;DR

This paper addresses interference in large-scale clusters where latency-critical services (LCSs) share servers with BEJs. It introduces PISM, a data-driven, proactive framework that characterizes BEJs, models the impact of BEJ compositions on LCS response times, and uses an interference scoring mechanism to guide BEJ scheduling. Key contributions include BEJ task characterization and 60-category BE task encoding, BEJ similarity clustering via Graph Kernel, two interference scoring models (for LCSs and servers) with k=10 levels, and interference prediction based on server BE compositions. Extensive evaluations on large traces and a 14-node cluster show PISM can reduce server interference by up to 41.5% and boost tail-heavy LCS throughput by up to 76.4%, while remaining integrable with common schedulers and scalable to production environments.

Abstract

Co-locating latency-critical services (LCSs) and best-effort jobs (BEJs) constitute the principal approach for enhancing resource utilization in production. Nevertheless, the co-location practice hurts the performance of LCSs due to resource competition, even when employing isolation technology. Through an extensive analysis of voluminous real trace data derived from two production clusters, we observe that BEJs typically exhibit periodic execution patterns and serve as the primary sources of interference to LCSs. Furthermore, despite occupying the same level of resource consumption, the diverse compositions of BEJs can result in varying degrees of interference on LCSs. Subsequently, we propose PISM, a proactive Performance Interference Scoring and Mitigating framework for LCSs through the optimization of BEJ scheduling. Firstly, PISM adopts a data-driven approach to establish a characterization and classification methodology for BEJs. Secondly, PISM models the relationship between the composition of BEJs on servers and the response time (RT) of LCSs. Thirdly, PISM establishes an interference scoring mechanism in terms of RT, which serves as the foundation for BEJ scheduling. We assess the effectiveness of PISM on a small-scale cluster and through extensive data-driven simulations. The experiment results demonstrate that PISM can reduce cluster interference by up to 41.5%, and improve the throughput of long-tail LCSs by 76.4%.
Paper Structure (34 sections, 3 equations, 18 figures, 4 tables)

This paper contains 34 sections, 3 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: The task composition of an example BEJ.
  • Figure 2: CDF of CPU utilization of servers, LCSs, and BEJs.
  • Figure 3: CDF of checkout instances' RT with and without BEJs.
  • Figure 4: The number of submitted BEJs per hour for the cluster and the average CPU utilization.
  • Figure 5: CDF of coefficient of variation for two different LCSs.
  • ...and 13 more figures