High Significant Fault Detection in Azure Core Workload Insights
Pranay Lohia, Laurent Boue, Sharath Rangappa, Vijay Agneeswaran
TL;DR
This work addresses the challenge of surfacing highly perceptible faults in Azure Core workload insights while curbing alert fatigue. It introduces MARIO, an AIOps-powered system that employs a two-stage anomaly detection pipeline on time-series data, combining WaveNet-based forecasting (Enhanced ADaaS) with Extreme Value Theory to isolate a small set of high-signal anomalies. The approach achieves high true-positive rates and low false positives, validated on both internal workloads and public benchmark datasets, and demonstrates superior identification of significant anomalies compared to state-of-the-art methods like TFT and DeepAR. Practically, it offers a production-ready, auditable anomaly detection framework with confidence scores, human-in-the-loop validation, and a clear path to scale across thousands of time-series in Azure core workloads.
Abstract
Azure Core workload insights have time-series data with different metric units. Faults or Anomalies are observed in these time-series data owing to faults observed with respect to metric name, resources region, dimensions, and its dimension value associated with the data. For Azure Core, an important task is to highlight faults or anomalies to the user on a dashboard that they can perceive easily. The number of anomalies reported should be highly significant and in a limited number, e.g., 5-20 anomalies reported per hour. The reported anomalies will have significant user perception and high reconstruction error in any time-series forecasting model. Hence, our task is to automatically identify 'high significant anomalies' and their associated information for user perception.
