Table of Contents
Fetching ...

LogSage: An LLM-Based Framework for CI/CD Failure Detection and Remediation with Industrial Validation

Weiyuan Xu, Juntao Luo, Tao Huang, Kaixin Sui, Jie Geng, Qijun Ma, Isami Akasaka, Xiaoxue Shi, Jing Tang, Peng Cai

TL;DR

LogSage tackles the challenge of noisy CI/CD logs by providing an end-to-end LLM-based RCA and automated remediation. It features a token-efficient offline preprocessing pipeline and a two-stage online process that combines RCA with multi-route retrieval and LLM tool-calling to generate executable fixes. On a curated 367-failure dataset and a year-long ByteDance deployment, it achieves high RCA precision and scalable end-to-end remediation, processing over 1 million executions with substantial adoption. The work delivers a reproducible dataset, a robust RCA pipeline, and a practical remediation framework suited for real-world DevOps workflows.

Abstract

Continuous Integration and Deployment (CI/CD) pipelines are critical to modern software engineering, yet diagnosing and resolving their failures remains complex and labor-intensive. We present LogSage, the first end-to-end LLM-powered framework for root cause analysis (RCA) and automated remediation of CI/CD failures. LogSage employs a token-efficient log preprocessing pipeline to filter noise and extract critical errors, then performs structured diagnostic prompting for accurate RCA. For solution generation, it leverages retrieval-augmented generation (RAG) to reuse historical fixes and invokes automation fixes via LLM tool-calling. On a newly curated benchmark of 367 GitHub CI/CD failures, LogSage achieves over 98\% precision, near-perfect recall, and an F1 improvement of more than 38\% points in the RCA stage, compared with recent LLM-based baselines. In a year-long industrial deployment at ByteDance, it processed over 1.07M executions, with end-to-end precision exceeding 80\%. These results demonstrate that LogSage provides a scalable and practical solution for automating CI/CD failure management in real-world DevOps workflows.

LogSage: An LLM-Based Framework for CI/CD Failure Detection and Remediation with Industrial Validation

TL;DR

LogSage tackles the challenge of noisy CI/CD logs by providing an end-to-end LLM-based RCA and automated remediation. It features a token-efficient offline preprocessing pipeline and a two-stage online process that combines RCA with multi-route retrieval and LLM tool-calling to generate executable fixes. On a curated 367-failure dataset and a year-long ByteDance deployment, it achieves high RCA precision and scalable end-to-end remediation, processing over 1 million executions with substantial adoption. The work delivers a reproducible dataset, a robust RCA pipeline, and a practical remediation framework suited for real-world DevOps workflows.

Abstract

Continuous Integration and Deployment (CI/CD) pipelines are critical to modern software engineering, yet diagnosing and resolving their failures remains complex and labor-intensive. We present LogSage, the first end-to-end LLM-powered framework for root cause analysis (RCA) and automated remediation of CI/CD failures. LogSage employs a token-efficient log preprocessing pipeline to filter noise and extract critical errors, then performs structured diagnostic prompting for accurate RCA. For solution generation, it leverages retrieval-augmented generation (RAG) to reuse historical fixes and invokes automation fixes via LLM tool-calling. On a newly curated benchmark of 367 GitHub CI/CD failures, LogSage achieves over 98\% precision, near-perfect recall, and an F1 improvement of more than 38\% points in the RCA stage, compared with recent LLM-based baselines. In a year-long industrial deployment at ByteDance, it processed over 1.07M executions, with end-to-end precision exceeding 80\%. These results demonstrate that LogSage provides a scalable and practical solution for automating CI/CD failure management in real-world DevOps workflows.

Paper Structure

This paper contains 29 sections, 4 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of the LogSage framework, consisting of an offline preparation phase for log template deduplication and knowledge base construction, and an online operational phase for RCA and solution generation with execution.
  • Figure 2: Prompt template for LogSage’s root cause analysis stage.
  • Figure 3: Prompt template for LogSage’s solution generation stage.
  • Figure 4: Token usage for RCA across methods and LLMs.
  • Figure 5: Query rounds for RCA across methods and LLMs.
  • ...and 3 more figures