Table of Contents
Fetching ...

Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight

Zhiqiang Xie, Yujia Zheng, Lizi Ottens, Kun Zhang, Christos Kozyrakis, Jonathan Mace

TL;DR

Cloud systems generate diverse telemetry and exhibit cascading symptoms, making automatic root-cause localization challenging. Atlas leverages large language models to synthesize causal graphs from system documentation, traces, and deployment feedback, decomposing the system into agents per component to discover local causal relationships and then composing a global graph. A novel confounder-graph construction plus a data-driven Pareto refinement step enables validation and correction, delivering high-quality graphs that support effective fault localization, often matching ground-truth performance and outperforming traditional data-driven causal discovery. The approach scales to complex cloud environments, provides robustness to missing metrics, and offers practical benefits for rapid diagnosis and triage in production settings, with open-source simulator and datasets to aid broader adoption.

Abstract

Runtime failure and performance degradation is commonplace in modern cloud systems. For cloud providers, automatically determining the root cause of incidents is paramount to ensuring high reliability and availability as prompt fault localization can enable faster diagnosis and triage for timely resolution. A compelling solution explored in recent work is causal reasoning using causal graphs to capture relationships between varied cloud system performance metrics. To be effective, however, systems developers must correctly define the causal graph of their system, which is a time-consuming, brittle, and challenging task that increases in difficulty for large and dynamic systems and requires domain expertise. Alternatively, automated data-driven approaches have limited efficacy for cloud systems due to the inherent rarity of incidents. In this work, we present Atlas, a novel approach to automatically synthesizing causal graphs for cloud systems. Atlas leverages large language models (LLMs) to generate causal graphs using system documentation, telemetry, and deployment feedback. Atlas is complementary to data-driven causal discovery techniques, and we further enhance Atlas with a data-driven validation step. We evaluate Atlas across a range of fault localization scenarios and demonstrate that Atlas is capable of generating causal graphs in a scalable and generalizable manner, with performance that far surpasses that of data-driven algorithms and is commensurate to the ground-truth baseline.

Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight

TL;DR

Cloud systems generate diverse telemetry and exhibit cascading symptoms, making automatic root-cause localization challenging. Atlas leverages large language models to synthesize causal graphs from system documentation, traces, and deployment feedback, decomposing the system into agents per component to discover local causal relationships and then composing a global graph. A novel confounder-graph construction plus a data-driven Pareto refinement step enables validation and correction, delivering high-quality graphs that support effective fault localization, often matching ground-truth performance and outperforming traditional data-driven causal discovery. The approach scales to complex cloud environments, provides robustness to missing metrics, and offers practical benefits for rapid diagnosis and triage in production settings, with open-source simulator and datasets to aid broader adoption.

Abstract

Runtime failure and performance degradation is commonplace in modern cloud systems. For cloud providers, automatically determining the root cause of incidents is paramount to ensuring high reliability and availability as prompt fault localization can enable faster diagnosis and triage for timely resolution. A compelling solution explored in recent work is causal reasoning using causal graphs to capture relationships between varied cloud system performance metrics. To be effective, however, systems developers must correctly define the causal graph of their system, which is a time-consuming, brittle, and challenging task that increases in difficulty for large and dynamic systems and requires domain expertise. Alternatively, automated data-driven approaches have limited efficacy for cloud systems due to the inherent rarity of incidents. In this work, we present Atlas, a novel approach to automatically synthesizing causal graphs for cloud systems. Atlas leverages large language models (LLMs) to generate causal graphs using system documentation, telemetry, and deployment feedback. Atlas is complementary to data-driven causal discovery techniques, and we further enhance Atlas with a data-driven validation step. We evaluate Atlas across a range of fault localization scenarios and demonstrate that Atlas is capable of generating causal graphs in a scalable and generalizable manner, with performance that far surpasses that of data-driven algorithms and is commensurate to the ground-truth baseline.
Paper Structure (22 sections, 4 figures, 2 tables)

This paper contains 22 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of Atlas: (1) Instantiating agents for each system component by parsing system documentation and telemetry; (2) Enumerating causal relationships between metrics; (3) Constructing the causal graph and refining with human-in-the-loop feedback.
  • Figure 2: Causal graph of a model serving task (some nodes and edges omitted for readability).
  • Figure 3: Simulated Model Serving Diagram.
  • Figure 4: Case study on localizing a fault in ModelServing