Praxium: Diagnosing Cloud Anomalies with AI-based Telemetry and Dependency Analysis

Rohan Kumar; Jason Li; Zongshun Zhang; Syed Mohammad Qasim; Gianluca Stringhini; Ayse Kivilcim Coskun

Praxium: Diagnosing Cloud Anomalies with AI-based Telemetry and Dependency Analysis

Rohan Kumar, Jason Li, Zongshun Zhang, Syed Mohammad Qasim, Gianluca Stringhini, Ayse Kivilcim Coskun

Abstract

As the modern microservice architecture for cloud applications grows in popularity, cloud services are becoming increasingly complex and more vulnerable to misconfiguration and software bugs. Traditional approaches rely on expert input to diagnose and fix microservice anomalies, which lacks scalability in the face of the continuous integration and continuous deployment (CI/CD) paradigm. Microservice rollouts, containing new software installations, have complex interactions with the components of an application. Consequently, this added difficulty in attributing anomalous behavior to any specific installation or rollout results in potentially slower resolution times. To address the gaps in current diagnostic methods, this paper introduces Praxium, a framework for anomaly detection and root cause inference. Praxium aids administrators in evaluating target metric performance in the context of dependency installation information provided by a software discovery tool, PraxiPaaS. Praxium continuously monitors telemetry data to identify anomalies, then conducts root cause analysis via causal impact on recent software installations, in order to provide site reliability engineers (SRE) relevant information about an observed anomaly. In this paper, we demonstrate that Praxium is capable of effective anomaly detection and root cause inference, and we provide an analysis on effective anomaly detection hyperparameter tuning as needed in a practical setting. Across 75 total trials using four synthetic anomalies, anomaly detection consistently performs at >0.97 macro-F1. In addition, we show that causal impact analysis reliably infers the correct root cause of anomalies, even as package installations occur at increasingly shorter intervals.

Praxium: Diagnosing Cloud Anomalies with AI-based Telemetry and Dependency Analysis

Abstract

Paper Structure (18 sections, 5 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 5 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Causality Graphs
Anomaly Detection
Machine Learning Based Root Cause Analysis
Software Bill of Materials
Vision for Praxium
Praxium
Software Dependency Logging System
Anomaly Detection and Triggering System
Root Cause Analysis and Causal Graph
Evaluations
Experimental Methodology
Anomaly Detection: Hyperparameter Tradeoffs
CausalImpact Analysis
...and 3 more sections

Figures (5)

Figure 1: An overview of Praxium. The system monitors a microservice cluster, collecting telemetry, trace, and installation data via Prometheus, Jaeger, and PraxiPaaS. Install logs are stored in MongoDB, while telemetry is streamed through anomaly detection. Upon detection of an anomaly, trace data populates the causal graph generator, and critical path logs along with target pod metrics pass to root cause analysis. From there, Praxium diagnoses the anomaly to a single installation.
Figure 2: An example of the software dependency logging system. A background process periodically scans for software changes that occurred within the past $T$ time units. For each service, it creates a SBOM log history in a persistent database. In the figure, there were $3$ rollouts for Service $A$ and each of them updates the $libcurl$ lib. The first generation found no changes to Service $A$ in the past $T$ unit time, so no entry is created in the database. The second one can see two rollouts, so it create an entry with the timestamp ($Ts_{0}$) as the key where the value is a list of sets including software changes in deployment order. Note that the first set includes all software of the service but the second set only has the changed ones. The final run can see the last rollout and creates a corresponding entry in the database.
Figure 3: An example of the hyperparameters of experiment 1. Stride $s$ controls the distance of time shifted between each window. Window size $w$ is the duration of each window. Threshold $T$ is the number of consecutive anomalous windows necessary to alarm the system about an anomaly.
Figure 4: An example of experiment 2, in two cases. First, the case 1 is trivial: there is an easily identified rollout that correlates to the shift in target metric. Then, in case 2, the quick succession of rollouts crowds around the shift in metric, obfuscating the best-correlated rollout.
Figure 5: An example of the Social Network service dependency graph during the ComposePost functionality. Here, an anomaly is injected into home-timeline-service while home-timeline-service, social-graph-service, and text-service are redeployed with new installations. Only installations along the critical path (in blue) are considered for causal inference.

Praxium: Diagnosing Cloud Anomalies with AI-based Telemetry and Dependency Analysis

Abstract

Praxium: Diagnosing Cloud Anomalies with AI-based Telemetry and Dependency Analysis

Authors

Abstract

Table of Contents

Figures (5)