High Significant Fault Detection in Azure Core Workload Insights

Pranay Lohia; Laurent Boue; Sharath Rangappa; Vijay Agneeswaran

High Significant Fault Detection in Azure Core Workload Insights

Pranay Lohia, Laurent Boue, Sharath Rangappa, Vijay Agneeswaran

TL;DR

This work addresses the challenge of surfacing highly perceptible faults in Azure Core workload insights while curbing alert fatigue. It introduces MARIO, an AIOps-powered system that employs a two-stage anomaly detection pipeline on time-series data, combining WaveNet-based forecasting (Enhanced ADaaS) with Extreme Value Theory to isolate a small set of high-signal anomalies. The approach achieves high true-positive rates and low false positives, validated on both internal workloads and public benchmark datasets, and demonstrates superior identification of significant anomalies compared to state-of-the-art methods like TFT and DeepAR. Practically, it offers a production-ready, auditable anomaly detection framework with confidence scores, human-in-the-loop validation, and a clear path to scale across thousands of time-series in Azure core workloads.

Abstract

Azure Core workload insights have time-series data with different metric units. Faults or Anomalies are observed in these time-series data owing to faults observed with respect to metric name, resources region, dimensions, and its dimension value associated with the data. For Azure Core, an important task is to highlight faults or anomalies to the user on a dashboard that they can perceive easily. The number of anomalies reported should be highly significant and in a limited number, e.g., 5-20 anomalies reported per hour. The reported anomalies will have significant user perception and high reconstruction error in any time-series forecasting model. Hence, our task is to automatically identify 'high significant anomalies' and their associated information for user perception.

High Significant Fault Detection in Azure Core Workload Insights

TL;DR

Abstract

Paper Structure (11 sections, 1 equation, 4 figures, 4 tables)

This paper contains 11 sections, 1 equation, 4 figures, 4 tables.

Introduction
Related Work
System: Maintainability Availability Reliability Intelligence Ops (MARIO)
System Details and Data Overview
Challenges and Opportunities with ML for Azure Core World
Data Structure Overview
Solution Overview
Result
Generalisation of Method
Comparison with State-of-Art
Conclusion

Figures (4)

Figure 1: Time-series plot showcasing high and low significant anomalies
Figure 2: Enhanced ADaaS + EVT method flow diagram
Figure 3: Plateaued performance plot of TFT and DeepAR with varying quantiles
Figure 4: Growth performance plot of EVT with varying quantiles

High Significant Fault Detection in Azure Core Workload Insights

TL;DR

Abstract

High Significant Fault Detection in Azure Core Workload Insights

Authors

TL;DR

Abstract

Table of Contents

Figures (4)