"Good" and "Bad" Failures in Industrial CI/CD -- Balancing Cost and Quality Assurance
Simin Sun, David Friberg, Miroslaw Staron
TL;DR
The paper addresses balancing cost and quality in industrial CI/CD by examining how real teams structure and optimize pipelines. It uses a year-long, qualitative study with eight practitioners across four companies to characterize CI/CD architectures and classify jobs into a taxonomy anchored around two milestones: code merge and product release. Key findings reveal prevalent tool coexistence, the continued dominance of Jenkins alongside GitHub Actions, and a strong case for prioritizing pre-merge optimization due to higher frequency of failures and larger pre-merge workload. The study highlights the need for pre-merge tooling, predictive mechanisms, and a universal framework to reduce wasted effort while guarding against late-stage failures, with future work exploring predictive models and LLM-based support.
Abstract
Continuous Integration and Continuous Deployment (CI/CD) pipeline automates software development to speed up and enhance the efficiency of engineering software. These workflows consist of various jobs, such as code validation and testing, which developers must wait to complete before receiving feedback. The jobs can fail, which leads to unnecessary delays in build times, decreasing productivity for developers, and increasing costs for companies. To explore how companies adopt CI/CD workflows and balance cost with quality assurance during optimization, we studied 4 companies, reporting industry experiences with CI/CD practices. Our findings reveal that organizations can confuse the distinction between CI and CD, whereas code merge and product release serve as more effective milestones for process optimization and risk control. While numerous tools and research efforts target the post-merge phase to enhance productivity, limited attention has been given to the pre-merge phase, where early failure prevention brings more impacts and less risks.
