FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications
Jin Huang, Pengfei Chen, Guangba Yu, Yilun Wang, Haiyu Huang, Zilong He
TL;DR
This work tackles root-cause analysis for serverless applications across the full lifecycle, addressing limitations of microservice RCA methods in handling transient, pulse-like data. It introduces FaaSRCA, which constructs a Global Call Graph that fuses multi-modal observability data from both platform and application sides and trains a Graph Attention Network (GAT) based graph auto-encoder to output reconstruction scores for nodes. Root causes are located by comparing node scores against normal patterns using a z-score-based deviation, enabling lifecycle-stage level localization; experiments on two serverless benchmarks show top-k precision improvements of 21.25% to 81.63% over baselines, with an average HR@k of 91.54% and NDCG@k of 94.62%. The approach achieves efficient inference (~8 ms per graph) and provides a practical, unsupervised RCA solution for improving serverless reliability in real-world deployments.
Abstract
Serverless becomes popular as a novel computing paradigms for cloud native services. However, the complexity and dynamic nature of serverless applications present significant challenges to ensure system availability and performance. There are many root cause analysis (RCA) methods for microservice systems, but they are not suitable for precise modeling serverless applications. This is because: (1) Compared to microservice, serverless applications exhibit a highly dynamic nature. They have short lifecycle and only generate instantaneous pulse-like data, lacking long-term continuous information. (2) Existing methods solely focus on analyzing the running stage and overlook other stages, failing to encompass the entire lifecycle of serverless applications. To address these limitations, we propose FaaSRCA, a full lifecycle root cause analysis method for serverless applications. It integrates multi-modal observability data generated from platform and application side by using Global Call Graph. We train a Graph Attention Network (GAT) based graph auto-encoder to compute reconstruction scores for the nodes in global call graph. Based on the scores, we determine the root cause at the granularity of the lifecycle stage of serverless functions. We conduct experimental evaluations on two serverless benchmarks, the results show that FaaSRCA outperforms other baseline methods with a top-k precision improvement ranging from 21.25% to 81.63%.
