How Industry Tackles Anomalies during Runtime: Approaches and Key Monitoring Parameters
Monika Steidl, Benedikt Dornauer, Michael Felderer, Rudolf Ramler, Mircea-Cristian Racasan, Marko Gattringer
TL;DR
This paper investigates how industry handles runtime anomalies in complex, microservice-based systems by combining an extended literature review with semi-structured interviews across multiple domains. It reveals a strong industry preference for rule-based anomaly detection, with AI-based approaches gaining attention in recent literature but limited production adoption due to data quality and false positives. The study identifies core runtime monitoring parameters drawn from logs, traces, and metrics (including the prospective energy consumption signal) and discusses how parameter relationships and monitoring strategies influence anomaly detection effectiveness. These findings motivate building explainable models that capture inter-parameter dependencies and causality, validated through benchmarks and real-world data, to improve robust anomaly detection in practice.
Abstract
Deviations from expected behavior during runtime, known as anomalies, have become more common due to the systems' complexity, especially for microservices. Consequently, analyzing runtime monitoring data, such as logs, traces for microservices, and metrics, is challenging due to the large volume of data collected. Developing effective rules or AI algorithms requires a deep understanding of this data to reliably detect unforeseen anomalies. This paper seeks to comprehend anomalies and current anomaly detection approaches across diverse industrial sectors. Additionally, it aims to pinpoint the parameters necessary for identifying anomalies via runtime monitoring data. Therefore, we conducted semi-structured interviews with fifteen industry participants who rely on anomaly detection during runtime. Additionally, to supplement information from the interviews, we performed a literature review focusing on anomaly detection approaches applied to industrial real-life datasets. Our paper (1) demonstrates the diversity of interpretations and examples of software anomalies during runtime and (2) explores the reasons behind choosing rule-based approaches in the industry over self-developed AI approaches. AI-based approaches have become prominent in published industry-related papers in the last three years. Furthermore, we (3) identified key monitoring parameters collected during runtime (logs, traces, and metrics) that assist practitioners in detecting anomalies during runtime without introducing bias in their anomaly detection approach due to inconclusive parameters.
