Table of Contents
Fetching ...

FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications

Jin Huang, Pengfei Chen, Guangba Yu, Yilun Wang, Haiyu Huang, Zilong He

TL;DR

This work tackles root-cause analysis for serverless applications across the full lifecycle, addressing limitations of microservice RCA methods in handling transient, pulse-like data. It introduces FaaSRCA, which constructs a Global Call Graph that fuses multi-modal observability data from both platform and application sides and trains a Graph Attention Network (GAT) based graph auto-encoder to output reconstruction scores for nodes. Root causes are located by comparing node scores against normal patterns using a z-score-based deviation, enabling lifecycle-stage level localization; experiments on two serverless benchmarks show top-k precision improvements of 21.25% to 81.63% over baselines, with an average HR@k of 91.54% and NDCG@k of 94.62%. The approach achieves efficient inference (~8 ms per graph) and provides a practical, unsupervised RCA solution for improving serverless reliability in real-world deployments.

Abstract

Serverless becomes popular as a novel computing paradigms for cloud native services. However, the complexity and dynamic nature of serverless applications present significant challenges to ensure system availability and performance. There are many root cause analysis (RCA) methods for microservice systems, but they are not suitable for precise modeling serverless applications. This is because: (1) Compared to microservice, serverless applications exhibit a highly dynamic nature. They have short lifecycle and only generate instantaneous pulse-like data, lacking long-term continuous information. (2) Existing methods solely focus on analyzing the running stage and overlook other stages, failing to encompass the entire lifecycle of serverless applications. To address these limitations, we propose FaaSRCA, a full lifecycle root cause analysis method for serverless applications. It integrates multi-modal observability data generated from platform and application side by using Global Call Graph. We train a Graph Attention Network (GAT) based graph auto-encoder to compute reconstruction scores for the nodes in global call graph. Based on the scores, we determine the root cause at the granularity of the lifecycle stage of serverless functions. We conduct experimental evaluations on two serverless benchmarks, the results show that FaaSRCA outperforms other baseline methods with a top-k precision improvement ranging from 21.25% to 81.63%.

FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications

TL;DR

This work tackles root-cause analysis for serverless applications across the full lifecycle, addressing limitations of microservice RCA methods in handling transient, pulse-like data. It introduces FaaSRCA, which constructs a Global Call Graph that fuses multi-modal observability data from both platform and application sides and trains a Graph Attention Network (GAT) based graph auto-encoder to output reconstruction scores for nodes. Root causes are located by comparing node scores against normal patterns using a z-score-based deviation, enabling lifecycle-stage level localization; experiments on two serverless benchmarks show top-k precision improvements of 21.25% to 81.63% over baselines, with an average HR@k of 91.54% and NDCG@k of 94.62%. The approach achieves efficient inference (~8 ms per graph) and provides a practical, unsupervised RCA solution for improving serverless reliability in real-world deployments.

Abstract

Serverless becomes popular as a novel computing paradigms for cloud native services. However, the complexity and dynamic nature of serverless applications present significant challenges to ensure system availability and performance. There are many root cause analysis (RCA) methods for microservice systems, but they are not suitable for precise modeling serverless applications. This is because: (1) Compared to microservice, serverless applications exhibit a highly dynamic nature. They have short lifecycle and only generate instantaneous pulse-like data, lacking long-term continuous information. (2) Existing methods solely focus on analyzing the running stage and overlook other stages, failing to encompass the entire lifecycle of serverless applications. To address these limitations, we propose FaaSRCA, a full lifecycle root cause analysis method for serverless applications. It integrates multi-modal observability data generated from platform and application side by using Global Call Graph. We train a Graph Attention Network (GAT) based graph auto-encoder to compute reconstruction scores for the nodes in global call graph. Based on the scores, we determine the root cause at the granularity of the lifecycle stage of serverless functions. We conduct experimental evaluations on two serverless benchmarks, the results show that FaaSRCA outperforms other baseline methods with a top-k precision improvement ranging from 21.25% to 81.63%.

Paper Structure

This paper contains 27 sections, 11 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: (a) Functions execution time in Serverless TrainTicket. (b) Functions execution time in Microsoft Azure.
  • Figure 2: CPU usage variation in serverless functions’ lifecycle.
  • Figure 3: Evaluation results of Eadro on two serverless datasets.
  • Figure 4: The lifecycle stages of serverless functions.
  • Figure 5: Overall architecture of FaaSRCA.
  • ...and 5 more figures