FC-ADL: Efficient Microservice Anomaly Detection and Localisation Through Functional Connectivity
Giles Winchester, George Parisis, Luc Berthouze
TL;DR
FC-ADL presents a scalable framework for detecting and localising microservice anomalies by learning time-varying functional connectivity from CPU metrics rather than relying on traces. It computes time-evolving FC graphs, measures structural changes with DeltaCon, clusters change events with HDBSCAN, and localises root causes via a random-walk analysis on changed dependencies. The approach achieves state-of-the-art detection and RCA performance across diverse fault scenarios and scales to large deployments like Alibaba with substantially lower data and computational overhead than causal approaches. These results demonstrate the practical viability of association-based, end-to-end anomaly detection and RCA in large-scale microservice environments.
Abstract
Microservices have transformed software architecture through the creation of modular and independent services. However, they introduce operational complexities in service integration and system management that makes swift and accurate anomaly detection and localisation challenging. Despite the complex, dynamic, and interconnected nature of microservice architectures, prior works that investigate metrics for anomaly detection rarely include explicit information about time-varying interdependencies. And whilst prior works on fault localisation typically do incorporate information about dependencies between microservices, they scale poorly to real world large-scale deployments due to their reliance on computationally expensive causal inference. To address these challenges we propose FC-ADL, an end-to-end scalable approach for detecting and localising anomalous changes from microservice metrics based on the neuroscientific concept of functional connectivity. We show that by efficiently characterising time-varying changes in dependencies between microservice metrics we can both detect anomalies and provide root cause candidates without incurring the significant overheads of causal and multivariate approaches. We demonstrate that our approach can achieve top detection and localisation performance across a wide degree of different fault scenarios when compared to state-of-the-art approaches. Furthermore, we illustrate the scalability of our approach by applying it to Alibaba's extremely large real-world microservice deployment.
