LogSage: An LLM-Based Framework for CI/CD Failure Detection and Remediation with Industrial Validation
Weiyuan Xu, Juntao Luo, Tao Huang, Kaixin Sui, Jie Geng, Qijun Ma, Isami Akasaka, Xiaoxue Shi, Jing Tang, Peng Cai
TL;DR
LogSage tackles the challenge of noisy CI/CD logs by providing an end-to-end LLM-based RCA and automated remediation. It features a token-efficient offline preprocessing pipeline and a two-stage online process that combines RCA with multi-route retrieval and LLM tool-calling to generate executable fixes. On a curated 367-failure dataset and a year-long ByteDance deployment, it achieves high RCA precision and scalable end-to-end remediation, processing over 1 million executions with substantial adoption. The work delivers a reproducible dataset, a robust RCA pipeline, and a practical remediation framework suited for real-world DevOps workflows.
Abstract
Continuous Integration and Deployment (CI/CD) pipelines are critical to modern software engineering, yet diagnosing and resolving their failures remains complex and labor-intensive. We present LogSage, the first end-to-end LLM-powered framework for root cause analysis (RCA) and automated remediation of CI/CD failures. LogSage employs a token-efficient log preprocessing pipeline to filter noise and extract critical errors, then performs structured diagnostic prompting for accurate RCA. For solution generation, it leverages retrieval-augmented generation (RAG) to reuse historical fixes and invokes automation fixes via LLM tool-calling. On a newly curated benchmark of 367 GitHub CI/CD failures, LogSage achieves over 98\% precision, near-perfect recall, and an F1 improvement of more than 38\% points in the RCA stage, compared with recent LLM-based baselines. In a year-long industrial deployment at ByteDance, it processed over 1.07M executions, with end-to-end precision exceeding 80\%. These results demonstrate that LogSage provides a scalable and practical solution for automating CI/CD failure management in real-world DevOps workflows.
