Table of Contents
Fetching ...

TrioXpert: An Automated Incident Management Framework for Microservice System

Yongqian Sun, Yu Luo, Xidao Wen, Yuan Yuan, Xiaohui Nie, Shenglin Zhang, Tong Liu, Xi Luo

TL;DR

TrioXpert addresses the challenge of automated incident management in large-scale microservice systems by leveraging multimodal data (metrics, logs, traces) through three modality-specific preprocessing pipelines, a multi-dimensional system status representation, and a collaborative LLM reasoning framework with three specialized experts. The approach enables simultaneous anomaly detection, failure triage, and root-cause localization with interpretable reasoning evidence, mitigating issues of semantic loss, data overload, and LLM hallucinations via structured prompts and a coordination pipeline. Empirical results on two real-world datasets show substantial improvements across AD, FT, and RCL tasks, with average per-case diagnosis times under 15 seconds and up to 163.1% relative gains in RCL performance; Lenovo deployment confirms practical gains in diagnostic efficiency and accuracy. The work demonstrates that dedicated modality-processing, robust representation, and multi-expert LLM collaboration can deliver scalable, interpretable, and production-ready incident management for complex microservice ecosystems.

Abstract

Automated incident management plays a pivotal role in large-scale microservice systems. However, many existing methods rely solely on single-modal data (e.g., metrics, logs, and traces) and struggle to simultaneously address multiple downstream tasks, including anomaly detection (AD), failure triage (FT), and root cause localization (RCL). Moreover, the lack of clear reasoning evidence in current techniques often leads to insufficient interpretability. To address these limitations, we propose TrioXpert, an end-to-end incident management framework capable of fully leveraging multimodal data. TrioXpert designs three independent data processing pipelines based on the inherent characteristics of different modalities, comprehensively characterizing the operational status of microservice systems from both numerical and textual dimensions. It employs a collaborative reasoning mechanism using large language models (LLMs) to simultaneously handle multiple tasks while providing clear reasoning evidence to ensure strong interpretability. We conducted extensive evaluations on two microservice system datasets, and the experimental results demonstrate that TrioXpert achieves outstanding performance in AD (improving by 4.7% to 57.7%), FT (improving by 2.1% to 40.6%), and RCL (improving by 1.6% to 163.1%) tasks. TrioXpert has also been deployed in Lenovo's production environment, demonstrating substantial gains in diagnostic efficiency and accuracy.

TrioXpert: An Automated Incident Management Framework for Microservice System

TL;DR

TrioXpert addresses the challenge of automated incident management in large-scale microservice systems by leveraging multimodal data (metrics, logs, traces) through three modality-specific preprocessing pipelines, a multi-dimensional system status representation, and a collaborative LLM reasoning framework with three specialized experts. The approach enables simultaneous anomaly detection, failure triage, and root-cause localization with interpretable reasoning evidence, mitigating issues of semantic loss, data overload, and LLM hallucinations via structured prompts and a coordination pipeline. Empirical results on two real-world datasets show substantial improvements across AD, FT, and RCL tasks, with average per-case diagnosis times under 15 seconds and up to 163.1% relative gains in RCL performance; Lenovo deployment confirms practical gains in diagnostic efficiency and accuracy. The work demonstrates that dedicated modality-processing, robust representation, and multi-expert LLM collaboration can deliver scalable, interpretable, and production-ready incident management for complex microservice ecosystems.

Abstract

Automated incident management plays a pivotal role in large-scale microservice systems. However, many existing methods rely solely on single-modal data (e.g., metrics, logs, and traces) and struggle to simultaneously address multiple downstream tasks, including anomaly detection (AD), failure triage (FT), and root cause localization (RCL). Moreover, the lack of clear reasoning evidence in current techniques often leads to insufficient interpretability. To address these limitations, we propose TrioXpert, an end-to-end incident management framework capable of fully leveraging multimodal data. TrioXpert designs three independent data processing pipelines based on the inherent characteristics of different modalities, comprehensively characterizing the operational status of microservice systems from both numerical and textual dimensions. It employs a collaborative reasoning mechanism using large language models (LLMs) to simultaneously handle multiple tasks while providing clear reasoning evidence to ensure strong interpretability. We conducted extensive evaluations on two microservice system datasets, and the experimental results demonstrate that TrioXpert achieves outstanding performance in AD (improving by 4.7% to 57.7%), FT (improving by 2.1% to 40.6%), and RCL (improving by 1.6% to 163.1%) tasks. TrioXpert has also been deployed in Lenovo's production environment, demonstrating substantial gains in diagnostic efficiency and accuracy.

Paper Structure

This paper contains 44 sections, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Multimodal data and incident lifecycle. (Top) Examples of the three data modalities used in incident management – metrics (time-series signals), logs (timestamped events), and a trace (service call graph). (Bottom) The four-stage incident management lifecycle: (1) Anomaly Detection, (2) Failure Triage, (3) Root Cause Localization, and (4) Incident Mitigation.
  • Figure 2: The overview of TrioXpert. The framework consists of three modules: (a) Multimodal Data Preprocessing, which filters and prepares metrics, traces, and logs (e.g., aligning by time and extracting relevant subsets); (b) Multi-Dimensional System Status Representation, which derives a numerical feature-based preliminary analysis and textual summaries of logs/traces (via specialized “Log Summarizer” and “Trace Summarizer”); and (c) LLMs Collaborative Reasoning, an multi-expert collaboration architecture comprising a Numerical Expert and Textual Expert that collaborate with an Incident Expert LLM to perform anomaly detection, failure triage, and root cause localization simultaneously.
  • Figure 3: The prompt of Log Summarizer. This prompt utilizes the "RGCIE" principle (i.e., Role, Goal, Constraints, Instructions, Example) to define the Log Summarizer, which helps mitigate the risk of hallucination.
  • Figure 4: The prompt of Incident Expert. This structure prompt is grounded in the "RGCIE" principle to mitigate hallucination, providing clear definition of the Incident Expert itself and its corresponding tasks (i.e., AD, FT, and RCL). The prompt also contains conflict resolution and aggregation policy to deal with the potential inconsistencies.