Modeling Anomaly Detection in Cloud Services: Analysis of the Properties that Impact Latency and Resource Consumption

Gabriel Job Antunes Grabher; Fumio Machida; Thomas Ropars

Modeling Anomaly Detection in Cloud Services: Analysis of the Properties that Impact Latency and Resource Consumption

Gabriel Job Antunes Grabher, Fumio Machida, Thomas Ropars

TL;DR

This work addresses how performance anomaly detectors influence latency and resource use in Cloud services. It introduces a Stochastic Reward Net (SRN) model that couples a Cloud service with an anomaly detector, translating detector precision, recall, and inspection frequency into SRN transitions and rewards to estimate latency and replica costs. Through numerical evaluation of detector flavors (e.g., Superior, GreatPrec, GreatRec, GoodPrec, GoodRec, Heuristic, Random) across varying inspection intervals, the study finds that maximizing both precision and recall is not always necessary; for frequent inspections, prioritizing precision yields near-optimal latency-cost trade-offs, while for infrequent inspections, emphasis on recall is more beneficial. The results offer practical guidance for tuning detectors and demonstrate the utility of SRN-based analysis to evaluate performance-cost implications in dynamic Cloud environments, with potential extensions to multi-service scenarios and alternative corrective actions.

Abstract

Detecting and resolving performance anomalies in Cloud services is crucial for maintaining desired performance objectives. Scaling actions triggered by an anomaly detector help achieve target latency at the cost of extra resource consumption. However, performance anomaly detectors make mistakes. This paper studies which characteristics of performance anomaly detection are important to optimize the trade-off between performance and cost. Using Stochastic Reward Nets, we model a Cloud service monitored by a performance anomaly detector. Using our model, we study the impact of detector characteristics, namely precision, recall and inspection frequency, on the average latency and resource consumption of the monitored service. Our results show that achieving a high precision and a high recall is not always necessary. If detection can be run frequently, a high precision is enough to obtain a good performance-to-cost trade-off, but if the detector is run infrequently, recall becomes the most important.

Modeling Anomaly Detection in Cloud Services: Analysis of the Properties that Impact Latency and Resource Consumption

TL;DR

Abstract

Modeling Anomaly Detection in Cloud Services: Analysis of the Properties that Impact Latency and Resource Consumption

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)