CI at Scale: Lean, Green, and Fast
Dhruva Juloori, Zhongpeng Lin, Matthew Williams, Eddy Shin, Sonal Mahajan
TL;DR
The paper tackles the challenge of landing changes rapidly while preserving a green mainline in Uber-scale monorepos. It extends SubmitQueue with an enhanced probabilistic model, NGBoost-based build-time predictions, BLRD, and a speculation-threshold mechanism, enabling efficient speculative execution and conflict-aware scheduling. In production across Go, iOS, and Android repos, the approach reduces CI resource usage by about 53%, CPU hours by 44%, and P95 waiting times by 37%, demonstrating substantial gains in efficiency and speed without compromising mainline integrity. The work combines system engineering with data-driven scheduling to enable scalable, cost-effective CI for large, high-velocity software ecosystems.
Abstract
Maintaining a "green" mainline branch, where all builds pass successfully, is crucial but challenging in fast-paced, large-scale software development environments, particularly with concurrent code changes in large monorepos. SubmitQueue, a system designed to address these challenges, speculatively executes builds and only lands changes with successful outcomes. However, despite its effectiveness, the system faces inefficiencies in resource utilization, leading to a high rate of premature build aborts and delays in landing smaller changes blocked by larger conflicting ones. This paper introduces enhancements to SubmitQueue, focusing on optimizing resource usage and improving build prioritization. Central to this is our innovative probabilistic model, which distinguishes between changes with shorter and longer build times to prioritize builds for more efficient scheduling. By leveraging a machine learning model to predict build times and incorporating this into the probabilistic framework, we expedite the landing of smaller changes blocked by conflicting larger time-consuming changes. Additionally, introducing a concept of speculation threshold ensures that only the most likely builds are executed, reducing unnecessary resource consumption. After implementing these enhancements across Uber's major monorepos (Go, iOS, and Android), we observed a reduction in Continuous Integration (CI) resource usage by approximately 53%, CPU usage by 44%, and P95 waiting times by 37%. These improvements highlight the enhanced efficiency of SubmitQueue in managing large-scale software changes while maintaining a green mainline.
