Table of Contents
Fetching ...

Cluster-Wide Task Slowdown Detection in Cloud System

Feiyi Chen, Yingying Zhang, Lunting Fan, Yuxuan Liang, Guansong Pang, Qingsong Wen, Shuiguang Deng

TL;DR

The paper tackles cluster-wide slowdown detection in cloud systems by shifting from per-task monitoring to the distribution of task durations across the cluster, enabling computation that is independent of the number of tasks. It introduces SORN, a three-part framework: Skimming Attention to capture compound periodicity, Neural Optimal Transport to align reconstructed fluctuations with non-slowing behavior, and a Picky Loss to mitigate training-time anomaly contamination, culminating in a specialized anomaly score based on distributional shifts. Empirical results on four real-world industrial datasets show SORN outperforms state-of-the-art baselines in F1, with favorable time and memory overhead and robustness to noise and lax periodicity. The approach promises practical impact for real-time AIOps in cloud centers by reliably detecting cluster-wide slowdowns while staying scalable and robust.

Abstract

Slow task detection is a critical problem in cloud operation and maintenance since it is highly related to user experience and can bring substantial liquidated damages. Most anomaly detection methods detect it from a single-task aspect. However, considering millions of concurrent tasks in large-scale cloud computing clusters, it becomes impractical and inefficient. Moreover, single-task slowdowns are very common and do not necessarily indicate a malfunction of a cluster due to its violent fluctuation nature in a virtual environment. Thus, we shift our attention to cluster-wide task slowdowns by utilizing the duration time distribution of tasks across a cluster, so that the computation complexity is not relevant to the number of tasks. The task duration time distribution often exhibits compound periodicity and local exceptional fluctuations over time. Though transformer-based methods are one of the most powerful methods to capture these time series normal variation patterns, we empirically find and theoretically explain the flaw of the standard attention mechanism in reconstructing subperiods with low amplitude when dealing with compound periodicity. To tackle these challenges, we propose SORN (i.e., Skimming Off subperiods in descending amplitude order and Reconstructing Non-slowing fluctuation), which consists of a Skimming Attention mechanism to reconstruct the compound periodicity and a Neural Optimal Transport module to distinguish cluster-wide slowdowns from other exceptional fluctuations. Furthermore, since anomalies in the training set are inevitable in a practical scenario, we propose a picky loss function, which adaptively assigns higher weights to reliable time slots in the training set. Extensive experiments demonstrate that SORN outperforms state-of-the-art methods on multiple real-world industrial datasets.

Cluster-Wide Task Slowdown Detection in Cloud System

TL;DR

The paper tackles cluster-wide slowdown detection in cloud systems by shifting from per-task monitoring to the distribution of task durations across the cluster, enabling computation that is independent of the number of tasks. It introduces SORN, a three-part framework: Skimming Attention to capture compound periodicity, Neural Optimal Transport to align reconstructed fluctuations with non-slowing behavior, and a Picky Loss to mitigate training-time anomaly contamination, culminating in a specialized anomaly score based on distributional shifts. Empirical results on four real-world industrial datasets show SORN outperforms state-of-the-art baselines in F1, with favorable time and memory overhead and robustness to noise and lax periodicity. The approach promises practical impact for real-time AIOps in cloud centers by reliably detecting cluster-wide slowdowns while staying scalable and robust.

Abstract

Slow task detection is a critical problem in cloud operation and maintenance since it is highly related to user experience and can bring substantial liquidated damages. Most anomaly detection methods detect it from a single-task aspect. However, considering millions of concurrent tasks in large-scale cloud computing clusters, it becomes impractical and inefficient. Moreover, single-task slowdowns are very common and do not necessarily indicate a malfunction of a cluster due to its violent fluctuation nature in a virtual environment. Thus, we shift our attention to cluster-wide task slowdowns by utilizing the duration time distribution of tasks across a cluster, so that the computation complexity is not relevant to the number of tasks. The task duration time distribution often exhibits compound periodicity and local exceptional fluctuations over time. Though transformer-based methods are one of the most powerful methods to capture these time series normal variation patterns, we empirically find and theoretically explain the flaw of the standard attention mechanism in reconstructing subperiods with low amplitude when dealing with compound periodicity. To tackle these challenges, we propose SORN (i.e., Skimming Off subperiods in descending amplitude order and Reconstructing Non-slowing fluctuation), which consists of a Skimming Attention mechanism to reconstruct the compound periodicity and a Neural Optimal Transport module to distinguish cluster-wide slowdowns from other exceptional fluctuations. Furthermore, since anomalies in the training set are inevitable in a practical scenario, we propose a picky loss function, which adaptively assigns higher weights to reliable time slots in the training set. Extensive experiments demonstrate that SORN outperforms state-of-the-art methods on multiple real-world industrial datasets.
Paper Structure (23 sections, 12 equations, 5 figures, 5 tables)

This paper contains 23 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) At each time slot, we use a stacked histogram bar to plot the frequency distribution of the duration time at that slot. We use a darker color to denote the interval requiring more duration time. The stacked histogram bar is ordered in time order. (b) The compound periodicity of task duration time. (c) The original series and series reconstructed by standard attention are plotted in one figure, where the subperiod with low amplitude can not be well reconstructed.
  • Figure 2: The model architecture of the proposed SORN algorithm.
  • Figure 3: (a) The figure shows different amplitudes of different subperiods; (b) The figure shows attention weight along different subperiods in $f(t)$. The width of the shadow is the value of the attention weight divided by 100 at the corresponding time slot. To distinguish the positive attention weight and negative attention weight we plot them in different colors and denote them by $\mathcal{A}^+$ and $\mathcal{A}^-$ respectively. (c) & (d) The visualization of SORN.
  • Figure 4: (a) We show the autocorrelation coefficient distribution at the interval of period length for subsets in every dataset. (b) The time and memory overhead of SORN and baselines on Sync dataset. We use the first two characters to stand for each method; (c) The hyperparameter sensitivity of the number of skimming layers and patch size on Sync dataset.
  • Figure 5: (a) We add noise to the original synthetic time series, whose standard deviation is the maximum amplitude of the original time series multiplied by the "noise" shown in the legend. Then, we test the performance of SORN for different slow task ratios. (b) For each period in a periodic time series, we extend it by a scaler which is randomly sampled from $(1,1+R]$. In this way, the original time series will have a lax periodicity. Then, we test the performance of SORN for different slow task ratios. (c) Using the same noise setting as (a), we test the performance of SORN for different average slow-down time. (d) Using the same period setting as (b), we test the performance of SORN for different average slow-down time.