Multi-modal Causal Structure Learning and Root Cause Analysis
Lecheng Zheng, Zhengzhang Chen, Jingrui He, Haifeng Chen
TL;DR
MULAN addresses root-cause analysis in complex, multi-modal systems by learning a unified causal graph from both metrics and logs. It combines a log-tailored language model for log-to-time-series representations, a contrastive framework to extract modality-invariant and modality-specific information, and a KPI-aware attention mechanism to robustly fuse modality-specific graphs before propagating faults via a random-walk process to rank root causes. The approach yields state-of-the-art performance across three real-world datasets, with strong robustness to low-quality modalities and clear gains from exploiting cross-modal correlations. This work offers a practical framework for accurate RCA in noisy, heterogeneous settings and points to online streaming extensions as a promising future direction.
Abstract
Effective root cause analysis (RCA) is vital for swiftly restoring services, minimizing losses, and ensuring the smooth operation and management of complex systems. Previous data-driven RCA methods, particularly those employing causal discovery techniques, have primarily focused on constructing dependency or causal graphs for backtracking the root causes. However, these methods often fall short as they rely solely on data from a single modality, thereby resulting in suboptimal solutions. In this work, we propose Mulan, a unified multi-modal causal structure learning method for root cause localization. We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data. To explore intricate relationships across different modalities, we propose a contrastive learning-based approach to extract modality-invariant and modality-specific representations within a shared latent space. Additionally, we introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph. Finally, we employ random walk with restart to simulate system fault propagation and identify potential root causes. Extensive experiments on three real-world datasets validate the effectiveness of our proposed framework.
