Table of Contents
Fetching ...

FIRED: a fine-grained robust performance diagnosis framework for cloud applications

Ruyue Xin, Hongyun Liu, Peng Chen, Paola Grosso, Zhiming Zhao

TL;DR

FIRED tackles robust, fine-grained performance diagnosis for cloud applications under labeling scarcity by integrating metrics selection, a weakly-supervised deep ensemble for anomaly detection, and a real-time root-cause localization pipeline. The framework uses a PC-based dependency graph and random-walk ranking to localize root causes with metric-level granularity, and employs a deep neural network to fuse multiple base detectors non-linearly. Evaluations on DApp monitoring data and the public SMD dataset show the deep ensemble achieves $F1$ scores above 0.8 for detection and averages above 0.7 for root-cause localization, with multi-minute lead times for anomaly prediction. Together, these components enable rapid, automated monitoring, recovery, and adaptive management for cloud applications in dynamic environments.

Abstract

To run a cloud application with the required service quality, operators have to continuously monitor the cloud application's run-time status, detect potential performance anomalies, and diagnose the root causes of anomalies. However, existing models of performance anomaly detection often suffer from low re-usability and robustness due to the diversity of system-level metrics being monitored and the lack of high-quality labeled monitoring data for anomalies. Moreover, the current coarse-grained analysis models make it difficult to locate system-level root causes of the application performance anomalies for effective adaptation decisions. We provide a FIne-grained Robust pErformance Diagnosis (FIRED) framework to tackle those challenges. The framework offers an ensemble of several well-selected base models for anomaly detection using a deep neural network, which adopts weakly-supervised learning considering fewer labels exist in reality. The framework also employs a real-time fine-grained analysis model to locate dependent system metrics of the anomaly. Our experiments show that the framework can achieve the best detection accuracy and algorithm robustness, and it can predict anomalies in four minutes with F1 score higher than 0.8. In addition, the framework can accurately localize the first root causes, and with an average accuracy higher than 0.7 of locating first four root causes.

FIRED: a fine-grained robust performance diagnosis framework for cloud applications

TL;DR

FIRED tackles robust, fine-grained performance diagnosis for cloud applications under labeling scarcity by integrating metrics selection, a weakly-supervised deep ensemble for anomaly detection, and a real-time root-cause localization pipeline. The framework uses a PC-based dependency graph and random-walk ranking to localize root causes with metric-level granularity, and employs a deep neural network to fuse multiple base detectors non-linearly. Evaluations on DApp monitoring data and the public SMD dataset show the deep ensemble achieves scores above 0.8 for detection and averages above 0.7 for root-cause localization, with multi-minute lead times for anomaly prediction. Together, these components enable rapid, automated monitoring, recovery, and adaptive management for cloud applications in dynamic environments.

Abstract

To run a cloud application with the required service quality, operators have to continuously monitor the cloud application's run-time status, detect potential performance anomalies, and diagnose the root causes of anomalies. However, existing models of performance anomaly detection often suffer from low re-usability and robustness due to the diversity of system-level metrics being monitored and the lack of high-quality labeled monitoring data for anomalies. Moreover, the current coarse-grained analysis models make it difficult to locate system-level root causes of the application performance anomalies for effective adaptation decisions. We provide a FIne-grained Robust pErformance Diagnosis (FIRED) framework to tackle those challenges. The framework offers an ensemble of several well-selected base models for anomaly detection using a deep neural network, which adopts weakly-supervised learning considering fewer labels exist in reality. The framework also employs a real-time fine-grained analysis model to locate dependent system metrics of the anomaly. Our experiments show that the framework can achieve the best detection accuracy and algorithm robustness, and it can predict anomalies in four minutes with F1 score higher than 0.8. In addition, the framework can accurately localize the first root causes, and with an average accuracy higher than 0.7 of locating first four root causes.
Paper Structure (30 sections, 9 equations, 12 figures, 12 tables, 1 algorithm)

This paper contains 30 sections, 9 equations, 12 figures, 12 tables, 1 algorithm.

Figures (12)

  • Figure 1: The FIRED framework for cloud applications, which has model training and real-time testing. The framework includes a. collected data and fewer labels will be used for b. metrics selection and c. performance anomaly detection. Trained detection model will be used for d. real-time anomaly detection. Once an anomaly is detected, e. root cause localization will start to discover causes of the anomaly. The localization results can be used for recovery of cloud applications.
  • Figure 2: The deep ensemble method for performance anomaly detection.
  • Figure 3: The root cause localization pipeline
  • Figure 4: Visualization of root cause localization pipeline
  • Figure 5: Monitoring a Hyperledger Fabric based Decentralized application and continually collecting data with Prometheus
  • ...and 7 more figures