Root Cause Analysis Method Based on Large Language Models with Residual Connection Structures
Liming Zhou, Ailing Liu, Hongwei Liu, Min He, Heng Zhang
TL;DR
The paper tackles root cause analysis in large-scale microservice systems by reframing RCA as deep temporal causal reasoning and introducing RC-LLM, a residual-connection-based framework that fuses multi-source telemetry (traces, metrics, logs) for robust evidence construction. The method integrates data via a residual fusion mechanism and constrains LLM outputs to structured, interpretable results, achieving up to $48.25\%$ accuracy on a challenging CCF-AIOps dataset while improving reasoning efficiency. Empirical results across four iterative stages show progressive gains, with the final configuration outperforming naïve baselines and providing coherent reasoning traces for diagnosis. Case studies illustrate interpretable, service-level root-cause localization even when trace signals are incomplete, highlighting the practical impact of combining structured multi-source data with LLM-based reasoning in dynamic microservice environments.
Abstract
Root cause localization remain challenging in complex and large-scale microservice architectures. The complex fault propagation among microservices and the high dimensionality of telemetry data, including metrics, logs, and traces, limit the effectiveness of existing root cause analysis (RCA) methods. In this paper, a residual-connection-based RCA method using large language model (LLM), named RC-LLM, is proposed. A residual-like hierarchical fusion structure is designed to integrate multi-source telemetry data, while the contextual reasoning capability of large language models is leveraged to model temporal and cross-microservice causal dependencies. Experimental results on CCF-AIOps microservice datasets demonstrate that RC-LLM achieves strong accuracy and efficiency in root cause analysis.
