LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis
Lecheng Zheng, Zhengzhang Chen, Dongjie Wang, Chengyuan Deng, Reon Matsuoka, Haifeng Chen
TL;DR
LEMMA-RCA addresses the lack of large-scale, open benchmarks for root cause analysis by introducing a multi-domain, multi-modal RCA dataset spanning IT and OT fault data with ground-truth root-cause labels. It combines rich time-series metrics and unstructured logs across microservice and cyber-physical system settings (e.g., SWaT and WADI), enabling comprehensive benchmarking. An extensive study evaluates six RCA methods under single- and multi-modal, offline and online configurations, demonstrating that data fusion yields substantial gains while highlighting dataset-dependent challenges such as short-lived faults. The dataset is publicly released to spur reproducible research and advance real-time RCA in mission-critical environments, with implications for multi-modal anomaly detection and LLM-assisted reasoning in the future.
Abstract
Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. However, progress in this field has been hindered by the lack of large-scale, open-source datasets tailored for RCA. To bridge this gap, we introduce LEMMA-RCA, a large dataset designed for diverse RCA tasks across multiple domains and modalities. LEMMA-RCA features various real-world fault scenarios from IT and OT operation systems, encompassing microservices, water distribution, and water treatment systems, with hundreds of system entities involved. We evaluate the quality of LEMMA-RCA by testing the performance of eight baseline methods on this dataset under various settings, including offline and online modes as well as single and multiple modalities. Our experimental results demonstrate the high quality of LEMMA-RCA. The dataset is publicly available at https://lemma-rca.github.io/.
