TrioXpert: An Automated Incident Management Framework for Microservice System
Yongqian Sun, Yu Luo, Xidao Wen, Yuan Yuan, Xiaohui Nie, Shenglin Zhang, Tong Liu, Xi Luo
TL;DR
TrioXpert addresses the challenge of automated incident management in large-scale microservice systems by leveraging multimodal data (metrics, logs, traces) through three modality-specific preprocessing pipelines, a multi-dimensional system status representation, and a collaborative LLM reasoning framework with three specialized experts. The approach enables simultaneous anomaly detection, failure triage, and root-cause localization with interpretable reasoning evidence, mitigating issues of semantic loss, data overload, and LLM hallucinations via structured prompts and a coordination pipeline. Empirical results on two real-world datasets show substantial improvements across AD, FT, and RCL tasks, with average per-case diagnosis times under 15 seconds and up to 163.1% relative gains in RCL performance; Lenovo deployment confirms practical gains in diagnostic efficiency and accuracy. The work demonstrates that dedicated modality-processing, robust representation, and multi-expert LLM collaboration can deliver scalable, interpretable, and production-ready incident management for complex microservice ecosystems.
Abstract
Automated incident management plays a pivotal role in large-scale microservice systems. However, many existing methods rely solely on single-modal data (e.g., metrics, logs, and traces) and struggle to simultaneously address multiple downstream tasks, including anomaly detection (AD), failure triage (FT), and root cause localization (RCL). Moreover, the lack of clear reasoning evidence in current techniques often leads to insufficient interpretability. To address these limitations, we propose TrioXpert, an end-to-end incident management framework capable of fully leveraging multimodal data. TrioXpert designs three independent data processing pipelines based on the inherent characteristics of different modalities, comprehensively characterizing the operational status of microservice systems from both numerical and textual dimensions. It employs a collaborative reasoning mechanism using large language models (LLMs) to simultaneously handle multiple tasks while providing clear reasoning evidence to ensure strong interpretability. We conducted extensive evaluations on two microservice system datasets, and the experimental results demonstrate that TrioXpert achieves outstanding performance in AD (improving by 4.7% to 57.7%), FT (improving by 2.1% to 40.6%), and RCL (improving by 1.6% to 163.1%) tasks. TrioXpert has also been deployed in Lenovo's production environment, demonstrating substantial gains in diagnostic efficiency and accuracy.
