LogPilot: Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems
Zhihan Jiang, Jinyang Liu, Yichen Li, Haiyu Huang, Xiao He, Tieying Zhang, Jianjun Chen, Yi Li, Rui Shi, Michael R. Lyu
TL;DR
LogPilot tackles the disconnect between industrial alerting and log-based RCA by introducing an intent-aware, scalable framework powered by LLMs. It combines an alert semantics-driven log scoping module, a request-centric log chain processor, and a clustering-based diagnosis pipeline that feeds compact, representative samples to LLMs for RCA, followed by a synthesis step. Evaluated on production data from Volcano Engine Cloud, LogPilot achieves substantial gains in root cause summarization and localization, with end-to-end diagnosis under one minute and a per-alert cost of $0.074, and has been deployed across multiple services. The work demonstrates practical impact by improving observability, reducing mean time to diagnose, and informing DevOps practices through log-quality feedback and SOP-enhancements.
Abstract
Effective alert diagnosis is essential for ensuring the reliability of large-scale online service systems. However, on-call engineers are often burdened with manually inspecting massive volumes of logs to identify root causes. While various automated tools have been proposed, they struggle in practice due to alert-agnostic log scoping and the inability to organize complex data effectively for reasoning. To overcome these limitations, we introduce LogPilot, an intent-aware and scalable framework powered by Large Language Models (LLMs) for automated log-based alert diagnosis. LogPilot introduces an intent-aware approach, interpreting the logic in alert definitions (e.g., PromQL) to precisely identify causally related logs and requests. To achieve scalability, it reconstructs each request's execution into a spatiotemporal log chain, clusters similar chains to identify recurring execution patterns, and provides representative samples to the LLMs for diagnosis. This clustering-based approach ensures the input is both rich in diagnostic detail and compact enough to fit within the LLM's context window. Evaluated on real-world alerts from Volcano Engine Cloud, LogPilot improves the usefulness of root cause summarization by 50.34% and exact localization accuracy by 54.79% over state-of-the-art methods. With a diagnosis time under one minute and a cost of only $0.074 per alert, LogPilot has been successfully deployed in production, offering an automated and practical solution for service alert diagnosis.
