Root Cause Analysis Method Based on Large Language Models with Residual Connection Structures

Liming Zhou; Ailing Liu; Hongwei Liu; Min He; Heng Zhang

Root Cause Analysis Method Based on Large Language Models with Residual Connection Structures

Liming Zhou, Ailing Liu, Hongwei Liu, Min He, Heng Zhang

TL;DR

The paper tackles root cause analysis in large-scale microservice systems by reframing RCA as deep temporal causal reasoning and introducing RC-LLM, a residual-connection-based framework that fuses multi-source telemetry (traces, metrics, logs) for robust evidence construction. The method integrates data via a residual fusion mechanism and constrains LLM outputs to structured, interpretable results, achieving up to $48.25\%$ accuracy on a challenging CCF-AIOps dataset while improving reasoning efficiency. Empirical results across four iterative stages show progressive gains, with the final configuration outperforming naïve baselines and providing coherent reasoning traces for diagnosis. Case studies illustrate interpretable, service-level root-cause localization even when trace signals are incomplete, highlighting the practical impact of combining structured multi-source data with LLM-based reasoning in dynamic microservice environments.

Abstract

Root cause localization remain challenging in complex and large-scale microservice architectures. The complex fault propagation among microservices and the high dimensionality of telemetry data, including metrics, logs, and traces, limit the effectiveness of existing root cause analysis (RCA) methods. In this paper, a residual-connection-based RCA method using large language model (LLM), named RC-LLM, is proposed. A residual-like hierarchical fusion structure is designed to integrate multi-source telemetry data, while the contextual reasoning capability of large language models is leveraged to model temporal and cross-microservice causal dependencies. Experimental results on CCF-AIOps microservice datasets demonstrate that RC-LLM achieves strong accuracy and efficiency in root cause analysis.

Root Cause Analysis Method Based on Large Language Models with Residual Connection Structures

TL;DR

accuracy on a challenging CCF-AIOps dataset while improving reasoning efficiency. Empirical results across four iterative stages show progressive gains, with the final configuration outperforming naïve baselines and providing coherent reasoning traces for diagnosis. Case studies illustrate interpretable, service-level root-cause localization even when trace signals are incomplete, highlighting the practical impact of combining structured multi-source data with LLM-based reasoning in dynamic microservice environments.

Abstract

Paper Structure (24 sections, 9 equations, 7 figures, 4 tables)

This paper contains 24 sections, 9 equations, 7 figures, 4 tables.

Introduction
Related Work
Rule-Based and Expert-Driven RCA Methods
Machine Learning-Based RCA Methods
LLM-Based RCA Methods
Method
Data Input
Data Preprocessing
Data Analysis
Trace Analysis
Metric Analysis
Log Analysis
Data Integration
LLM Reasoning
Experiment
...and 9 more sections

Figures (7)

Figure 1: Root Cause Analysis for Microservice-based Systems.
Figure 2: Architecture of RC-LLM.
Figure 3: Sudden Numerical Anomaly of CPU Usage.
Figure 4: Sudden Numerical Anomaly of RRT.
Figure 5: Request-response mismatch Anomaly.
...and 2 more figures

Root Cause Analysis Method Based on Large Language Models with Residual Connection Structures

TL;DR

Abstract

Root Cause Analysis Method Based on Large Language Models with Residual Connection Structures

Authors

TL;DR

Abstract

Table of Contents

Figures (7)