What Happened in This Pipeline? Diffing Build Logs with CiDiff

Nicolas Hubner; Jean-Rémy Falleri; Raluca Uricaru; Thomas Degueule; Thomas Durieux

What Happened in This Pipeline? Diffing Build Logs with CiDiff

Nicolas Hubner, Jean-Rémy Falleri, Raluca Uricaru, Thomas Degueule, Thomas Durieux

TL;DR

This paper tackles the challenge of diagnosing CI pipeline regressions by diffing failing and passing build logs. It introduces CiDiff, a log-diff tool that uses a log-line similarity metric and a genome-inspired seed-and-extend strategy to detect updated and moved lines, producing concise edit scripts with six action types. Through large-scale evaluation on 17,906 regression pairs and baselines like LCS-diff, bigrams, and keywords, CiDiff achieves substantially shorter diffs and higher precision while maintaining strong recall, and is preferred by participants in a majority of user-study cases. The work demonstrates practical impact by reducing the number of lines to inspect and delivering a usable open-source tool with a rich dataset, enabling broader adoption for CI failure debugging. Future work points to extending CiDiff to more use cases and integrating parsers or LLM-based similarity enhancements to further improve accuracy and coverage.

Abstract

Continuous integration (CI) is widely used by developers to ensure the quality and reliability of their software projects. However, diagnosing a CI regression is a tedious process that involves the manual analysis of lengthy build logs. In this paper, we explore how textual differencing can support the debugging of CI regressions. As off-the-shelf diff algorithms produce suboptimal results, in this work we introduce a new diff algorithm specifically tailored to build logs called CiDiff. We evaluate CiDiff against several baselines on a novel dataset of 17 906 CI regressions, performing an accuracy study, a quantitative study and a user-study. Notably, our algorithm reduces the number of lines to inspect by about 60 % in the median case, with reasonable overhead compared to the state-of-practice LCS-diff. Finally, our algorithm is preferred by the majority of participants in 70 % of the regression cases, whereas LCS-diff is preferred in only 5 % of the cases.

What Happened in This Pipeline? Diffing Build Logs with CiDiff

TL;DR

Abstract

What Happened in This Pipeline? Diffing Build Logs with CiDiff

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)