What Happened in This Pipeline? Diffing Build Logs with CiDiff
Nicolas Hubner, Jean-Rémy Falleri, Raluca Uricaru, Thomas Degueule, Thomas Durieux
TL;DR
This paper tackles the challenge of diagnosing CI pipeline regressions by diffing failing and passing build logs. It introduces CiDiff, a log-diff tool that uses a log-line similarity metric and a genome-inspired seed-and-extend strategy to detect updated and moved lines, producing concise edit scripts with six action types. Through large-scale evaluation on 17,906 regression pairs and baselines like LCS-diff, bigrams, and keywords, CiDiff achieves substantially shorter diffs and higher precision while maintaining strong recall, and is preferred by participants in a majority of user-study cases. The work demonstrates practical impact by reducing the number of lines to inspect and delivering a usable open-source tool with a rich dataset, enabling broader adoption for CI failure debugging. Future work points to extending CiDiff to more use cases and integrating parsers or LLM-based similarity enhancements to further improve accuracy and coverage.
Abstract
Continuous integration (CI) is widely used by developers to ensure the quality and reliability of their software projects. However, diagnosing a CI regression is a tedious process that involves the manual analysis of lengthy build logs. In this paper, we explore how textual differencing can support the debugging of CI regressions. As off-the-shelf diff algorithms produce suboptimal results, in this work we introduce a new diff algorithm specifically tailored to build logs called CiDiff. We evaluate CiDiff against several baselines on a novel dataset of 17 906 CI regressions, performing an accuracy study, a quantitative study and a user-study. Notably, our algorithm reduces the number of lines to inspect by about 60 % in the median case, with reasonable overhead compared to the state-of-practice LCS-diff. Finally, our algorithm is preferred by the majority of participants in 70 % of the regression cases, whereas LCS-diff is preferred in only 5 % of the cases.
