Table of Contents
Fetching ...

A Holistic Framework for Automated Configuration Recommendation for Cloud Service Monitoring

Anson Bastos, Shreeya Venneti, Anjaly Parayil, Ayush Choure, Chetan Bansal, Rujia Wang

Abstract

Reliability of large-scale cloud services is critical for user satisfaction and business continuity. Despite significant investments in reliability engineering, production incidents remain inevitable, often leading to customer impact and operational overhead. In large cloud companies, multiple services are deployed across regions necessitating robust health monitoring systems. However, the current monitor configuration process is manual, largely reactive and ad hoc, resulting in gaps in coverage and redundant alerts. In this paper, we present a comprehensive study of monitor creation in Microsoft, identifying key components in the existing process. We further design a modular recommendation framework that processes the graph structured service entities to suggest optimal monitor configurations. Through extensive experimentation on historical data and user study of recommendations for production services at Microsoft, we demonstrate the efficacy of our approach in providing relevant recommendations for monitor configurations.

A Holistic Framework for Automated Configuration Recommendation for Cloud Service Monitoring

Abstract

Reliability of large-scale cloud services is critical for user satisfaction and business continuity. Despite significant investments in reliability engineering, production incidents remain inevitable, often leading to customer impact and operational overhead. In large cloud companies, multiple services are deployed across regions necessitating robust health monitoring systems. However, the current monitor configuration process is manual, largely reactive and ad hoc, resulting in gaps in coverage and redundant alerts. In this paper, we present a comprehensive study of monitor creation in Microsoft, identifying key components in the existing process. We further design a modular recommendation framework that processes the graph structured service entities to suggest optimal monitor configurations. Through extensive experimentation on historical data and user study of recommendations for production services at Microsoft, we demonstrate the efficacy of our approach in providing relevant recommendations for monitor configurations.
Paper Structure (20 sections, 6 equations, 11 figures, 2 tables)

This paper contains 20 sections, 6 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: System Overview: a) Monitor entity graph: Nodes represent the monitors, metrics, and dimensions in a cloud setting. Each node contains text features and interacts with their neighboring nodes using link communication and b) The overall service health where the cloud service emits metrics along many dimensions and only few are being used for monitoring health. For eg. here, the service health depends on health of each datacenter using the service which in turn depends on the health of the individual VMs in the datacenter. The alerting conditions are applied over the defined expressions on metrics and are decided by the business need.
  • Figure 2: Characteristics of the Monitor Entity Graph: a) Distribution of degree associated with dimensions based on the metric-to-dimension links, b) Distribution of the percentage of dimensions selected from the set of all dimensions along which the metric is emitted.
  • Figure 3: a) Variation in jaccard similarity of set of dimensions associated with monitors with similar metric, monitor names, and same service account, and b) Distribution of pairwise correlation between dimensions.
  • Figure 4: Study of the categorization of the Expressions used by the monitors. We can observe that most expressions ( 83%) use either of count, sum or average as the mathematical form to aggregate the metrics data.
  • Figure 5: Analysis of the importance of time series features on predicting the monitoring status of the metric.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2: Problem formulation