Table of Contents
Fetching ...

Agentic Troubleshooting Guide Automation for Incident Management

Jiayi Mao, Liqun Li, Yanjie Gao, Zegang Peng, Shilin He, Chaoyun Zhang, Si Qin, Samia Khalid, Qingwei Lin, Saravan Rajmohan, Sitaram Lanka, Dongmei Zhang

TL;DR

This paper tackles the automation of incident management by addressing the unstructured and data-intensive nature of troubleshooting guides (TSGs). It introduces StepFly, a three-stage framework that (i) improves TSG quality via the TSG Mentor, (ii) offline preprocesses TSGs to extract execution DAGs and QPPs, and (iii) online executes guided by a DAG with a memory-enabled scheduler-executor and plugins. Empirical studies on 92 real TSGs and 80 incidents show StepFly achieves about 94% success on GPT-4.1 and up to 84% on a mid-range model, with parallelization reducing execution time by 32.9%–70.4% for parallelizable TSGs. The work demonstrates that combining structured preprocessing with DAG-guided, memory-aware execution substantially enhances reliability and efficiency in automated incident resolution, and it discusses practicality and extensibility for real-world deployment.

Abstract

Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist SREs in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution DAGs from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to guarantee correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ~94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs.

Agentic Troubleshooting Guide Automation for Incident Management

TL;DR

This paper tackles the automation of incident management by addressing the unstructured and data-intensive nature of troubleshooting guides (TSGs). It introduces StepFly, a three-stage framework that (i) improves TSG quality via the TSG Mentor, (ii) offline preprocesses TSGs to extract execution DAGs and QPPs, and (iii) online executes guided by a DAG with a memory-enabled scheduler-executor and plugins. Empirical studies on 92 real TSGs and 80 incidents show StepFly achieves about 94% success on GPT-4.1 and up to 84% on a mid-range model, with parallelization reducing execution time by 32.9%–70.4% for parallelizable TSGs. The work demonstrates that combining structured preprocessing with DAG-guided, memory-aware execution substantially enhances reliability and efficiency in automated incident resolution, and it discusses practicality and extensibility for real-world deployment.

Abstract

Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist SREs in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution DAGs from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to guarantee correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ~94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs.

Paper Structure

This paper contains 40 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Workflow of a real TSG for diagnosing availability incidents of a large online service.
  • Figure 2: Statistics on TSG characteristics: (a) Token count distribution, (b) Step count Distribution, (c) Tool usage, (d) Query template percentage.
  • Figure 3: TSG Issue Distribution.
  • Figure 4: The Proposed Approach
  • Figure 5: The execution DAG of the example TSG shown in Fig. \ref{['fig:example_tsg']}. The conditional edges are associated with the key questions and labels: "Y" or "N", and other edges are unconditional.
  • ...and 5 more figures