Table of Contents
Fetching ...

LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis

Lecheng Zheng, Zhengzhang Chen, Dongjie Wang, Chengyuan Deng, Reon Matsuoka, Haifeng Chen

TL;DR

LEMMA-RCA addresses the lack of large-scale, open benchmarks for root cause analysis by introducing a multi-domain, multi-modal RCA dataset spanning IT and OT fault data with ground-truth root-cause labels. It combines rich time-series metrics and unstructured logs across microservice and cyber-physical system settings (e.g., SWaT and WADI), enabling comprehensive benchmarking. An extensive study evaluates six RCA methods under single- and multi-modal, offline and online configurations, demonstrating that data fusion yields substantial gains while highlighting dataset-dependent challenges such as short-lived faults. The dataset is publicly released to spur reproducible research and advance real-time RCA in mission-critical environments, with implications for multi-modal anomaly detection and LLM-assisted reasoning in the future.

Abstract

Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. However, progress in this field has been hindered by the lack of large-scale, open-source datasets tailored for RCA. To bridge this gap, we introduce LEMMA-RCA, a large dataset designed for diverse RCA tasks across multiple domains and modalities. LEMMA-RCA features various real-world fault scenarios from IT and OT operation systems, encompassing microservices, water distribution, and water treatment systems, with hundreds of system entities involved. We evaluate the quality of LEMMA-RCA by testing the performance of eight baseline methods on this dataset under various settings, including offline and online modes as well as single and multiple modalities. Our experimental results demonstrate the high quality of LEMMA-RCA. The dataset is publicly available at https://lemma-rca.github.io/.

LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis

TL;DR

LEMMA-RCA addresses the lack of large-scale, open benchmarks for root cause analysis by introducing a multi-domain, multi-modal RCA dataset spanning IT and OT fault data with ground-truth root-cause labels. It combines rich time-series metrics and unstructured logs across microservice and cyber-physical system settings (e.g., SWaT and WADI), enabling comprehensive benchmarking. An extensive study evaluates six RCA methods under single- and multi-modal, offline and online configurations, demonstrating that data fusion yields substantial gains while highlighting dataset-dependent challenges such as short-lived faults. The dataset is publicly released to spur reproducible research and advance real-time RCA in mission-critical environments, with implications for multi-modal anomaly detection and LLM-assisted reasoning in the future.

Abstract

Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. However, progress in this field has been hindered by the lack of large-scale, open-source datasets tailored for RCA. To bridge this gap, we introduce LEMMA-RCA, a large dataset designed for diverse RCA tasks across multiple domains and modalities. LEMMA-RCA features various real-world fault scenarios from IT and OT operation systems, encompassing microservices, water distribution, and water treatment systems, with hundreds of system entities involved. We evaluate the quality of LEMMA-RCA by testing the performance of eight baseline methods on this dataset under various settings, including offline and online modes as well as single and multiple modalities. Our experimental results demonstrate the high quality of LEMMA-RCA. The dataset is publicly available at https://lemma-rca.github.io/.
Paper Structure (22 sections, 3 equations, 7 figures, 6 tables)

This paper contains 22 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Visualization of the microservice system platform, which contains 6 nodes and multiple pods that may vary across different stages; and the ElasticSearch log data.
  • Figure 2: Visualization of KPI for system failure cases. Left: the first two sub-figures are from the Product Review sub-dataset; the third and fourth sub-figures are from the Cloud Computing sub-dataset; Right: the first two sub-figures are from the SWaT sub-dataset; the last two sub-figures are from the WADI sub-dataset.
  • Figure 3: Visualization of two system fault scenarios. Left: Cryptojacking. Right: External storage failure.
  • Figure 4: Visualization of root cause for one system failure case (i.e., External Storage Failure) on the Product Review Platform. Left: six system metrics of root cause. Right: the system log of the root cause pod (i.e., Mongodb-v1) with the x-axis representing the timestamp, the y-axis indicating the log event ID, and the colored dots denoting event occurrences. Sudden drops in the metrics data, as well as new log event patterns observed at the midpoint, indicate a system failure.
  • Figure 5: Corresponding to \ref{['fig_aiops_structure']} (a). The architecture of Product Review Platform
  • ...and 2 more figures