Table of Contents
Fetching ...

LogRCA: Log-based Root Cause Analysis for Distributed Services

Thorsten Wittkopp, Philipp Wiesner, Odej Kao

TL;DR

LogRCA tackles the challenge of identifying a minimal, contextually relevant set of log lines that explain the root cause of failures in distributed services. It frames root cause analysis as a semi-supervised positive-unlabeled problem and leverages a transformer encoder with a custom loss to score log lines within an investigation window, while balancing training data to improve rare-case performance. On a large production Android log dataset (44.3 million lines, 398 failures, 80 labeled windows), LogRCA consistently outperforms baselines in recall, with 93.5% recall at top-10 candidates and substantial coverage at higher candidate counts. The approach shows practical impact for AIOps by enabling operators to quickly pinpoint root-cause signals among vast noisy logs, with data-balancing proving especially beneficial for rare or unseen failures.

Abstract

To assist IT service developers and operators in managing their increasingly complex service landscapes, there is a growing effort to leverage artificial intelligence in operations. To speed up troubleshooting, log anomaly detection has received much attention in particular, dealing with the identification of log events that indicate the reasons for a system failure. However, faults often propagate extensively within systems, which can result in a large number of anomalies being detected by existing approaches. In this case, it can remain very challenging for users to quickly identify the actual root cause of a failure. We propose LogRCA, a novel method for identifying a minimal set of log lines that together describe a root cause. LogRCA uses a semi-supervised learning approach to deal with rare and unknown errors and is designed to handle noisy data. We evaluated our approach on a large-scale production log data set of 44.3 million log lines, which contains 80 failures, whose root causes were labeled by experts. LogRCA consistently outperforms baselines based on deep learning and statistical analysis in terms of precision and recall to detect candidate root causes. In addition, we investigated the impact of our deployed data balancing approach, demonstrating that it considerably improves performance on rare failures.

LogRCA: Log-based Root Cause Analysis for Distributed Services

TL;DR

LogRCA tackles the challenge of identifying a minimal, contextually relevant set of log lines that explain the root cause of failures in distributed services. It frames root cause analysis as a semi-supervised positive-unlabeled problem and leverages a transformer encoder with a custom loss to score log lines within an investigation window, while balancing training data to improve rare-case performance. On a large production Android log dataset (44.3 million lines, 398 failures, 80 labeled windows), LogRCA consistently outperforms baselines in recall, with 93.5% recall at top-10 candidates and substantial coverage at higher candidate counts. The approach shows practical impact for AIOps by enabling operators to quickly pinpoint root-cause signals among vast noisy logs, with data-balancing proving especially beneficial for rare or unseen failures.

Abstract

To assist IT service developers and operators in managing their increasingly complex service landscapes, there is a growing effort to leverage artificial intelligence in operations. To speed up troubleshooting, log anomaly detection has received much attention in particular, dealing with the identification of log events that indicate the reasons for a system failure. However, faults often propagate extensively within systems, which can result in a large number of anomalies being detected by existing approaches. In this case, it can remain very challenging for users to quickly identify the actual root cause of a failure. We propose LogRCA, a novel method for identifying a minimal set of log lines that together describe a root cause. LogRCA uses a semi-supervised learning approach to deal with rare and unknown errors and is designed to handle noisy data. We evaluated our approach on a large-scale production log data set of 44.3 million log lines, which contains 80 failures, whose root causes were labeled by experts. LogRCA consistently outperforms baselines based on deep learning and statistical analysis in terms of precision and recall to detect candidate root causes. In addition, we investigated the impact of our deployed data balancing approach, demonstrating that it considerably improves performance on rare failures.
Paper Structure (23 sections, 4 equations, 7 figures)

This paper contains 23 sections, 4 equations, 7 figures.

Figures (7)

  • Figure 1: LogRCA helps users to identify a minimal set of root cause log lines that reside within a investigation time window prior to the failure.
  • Figure 2: Illustrating the training process with incorrectly labeled data and the result for $n=3$. Log lines in orange have been assigned to the unknown class $\mathcal{U}$, while black log lines are assigned to the normal class $\mathcal{P}$.
  • Figure 3: Balancing the training data.
  • Figure 4: Steps for selecting root cause candidates.
  • Figure 5: Investigation time window sizes and their fraction of root cause log lines.
  • ...and 2 more figures