Table of Contents
Fetching ...

Multi-modal Causal Structure Learning and Root Cause Analysis

Lecheng Zheng, Zhengzhang Chen, Jingrui He, Haifeng Chen

TL;DR

MULAN addresses root-cause analysis in complex, multi-modal systems by learning a unified causal graph from both metrics and logs. It combines a log-tailored language model for log-to-time-series representations, a contrastive framework to extract modality-invariant and modality-specific information, and a KPI-aware attention mechanism to robustly fuse modality-specific graphs before propagating faults via a random-walk process to rank root causes. The approach yields state-of-the-art performance across three real-world datasets, with strong robustness to low-quality modalities and clear gains from exploiting cross-modal correlations. This work offers a practical framework for accurate RCA in noisy, heterogeneous settings and points to online streaming extensions as a promising future direction.

Abstract

Effective root cause analysis (RCA) is vital for swiftly restoring services, minimizing losses, and ensuring the smooth operation and management of complex systems. Previous data-driven RCA methods, particularly those employing causal discovery techniques, have primarily focused on constructing dependency or causal graphs for backtracking the root causes. However, these methods often fall short as they rely solely on data from a single modality, thereby resulting in suboptimal solutions. In this work, we propose Mulan, a unified multi-modal causal structure learning method for root cause localization. We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data. To explore intricate relationships across different modalities, we propose a contrastive learning-based approach to extract modality-invariant and modality-specific representations within a shared latent space. Additionally, we introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph. Finally, we employ random walk with restart to simulate system fault propagation and identify potential root causes. Extensive experiments on three real-world datasets validate the effectiveness of our proposed framework.

Multi-modal Causal Structure Learning and Root Cause Analysis

TL;DR

MULAN addresses root-cause analysis in complex, multi-modal systems by learning a unified causal graph from both metrics and logs. It combines a log-tailored language model for log-to-time-series representations, a contrastive framework to extract modality-invariant and modality-specific information, and a KPI-aware attention mechanism to robustly fuse modality-specific graphs before propagating faults via a random-walk process to rank root causes. The approach yields state-of-the-art performance across three real-world datasets, with strong robustness to low-quality modalities and clear gains from exploiting cross-modal correlations. This work offers a practical framework for accurate RCA in noisy, heterogeneous settings and points to online streaming extensions as a promising future direction.

Abstract

Effective root cause analysis (RCA) is vital for swiftly restoring services, minimizing losses, and ensuring the smooth operation and management of complex systems. Previous data-driven RCA methods, particularly those employing causal discovery techniques, have primarily focused on constructing dependency or causal graphs for backtracking the root causes. However, these methods often fall short as they rely solely on data from a single modality, thereby resulting in suboptimal solutions. In this work, we propose Mulan, a unified multi-modal causal structure learning method for root cause localization. We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data. To explore intricate relationships across different modalities, we propose a contrastive learning-based approach to extract modality-invariant and modality-specific representations within a shared latent space. Additionally, we introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph. Finally, we employ random walk with restart to simulate system fault propagation and identify potential root causes. Extensive experiments on three real-world datasets validate the effectiveness of our proposed framework.
Paper Structure (19 sections, 18 equations, 4 figures, 5 tables)

This paper contains 19 sections, 18 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The overview of the proposed framework MULAN. It consists of four main modules: representation extraction via log-tailored language model, contrastive multi-modal causal structure learning, causal graph fusion with KPI-aware attention, and network propagation-based root cause localization.
  • Figure 2: The overview of log representation extraction. It first uses a log parser to extract the log templates. The inputs of the language model are log sequences, where unique log templates are followed by their frequencies within a fixed time window. The label information ( i.e., scores) are obtained through anomaly detection methods to guide the log sequence representation learning. [CLS] is a special token used for downstream tasks.
  • Figure 3: Case study on Product Review dataset. (a): MRR score of all methods evaluated with a single system metric only. (b): MRR score of all methods evaluated with one system metric and system log. (c): Modality weight measured by KPI-aware mechanism of MULAN with four system fault cases, where $M^+$, $L (M^+)$, $M^-$, and $L (M^-)$ are the weight of the high-quality metric, the weight of the system log with the high-quality metric, the weight of the low-quality metric and the weight of the system log with the low-quality metric, respectively.
  • Figure 4: Parameter analysis on Product Review dataset w.r.t MRR. The red dashed line denotes the value used in Table \ref{['table_result_1']}.