Table of Contents
Fetching ...

Cross-Project Flakiness: A Case Study of the OpenStack Ecosystem

Tao Xiao, Dong Wang, Shane McIntosh, Hideaki Hata, Yasutaka Kamei

TL;DR

This paper investigates ecosystem-wide test flakiness in the OpenStack OpenStack ecosystem, focusing on cross-project flakiness (tests flaky across multiple projects) and inconsistent flakiness (tests flaky in some projects but stable in others). Using a data-driven case study of 649 OpenStack projects over one year, the authors quantify prevalence (about 55% of projects affected) and dissect how test scopes influence propagation, revealing that unit tests are heavily implicated in cross-project flakiness (about 70% show cross-project effects) while API and scenario tests also contribute. They combine quantitative analysis of flaky builds and tests with qualitative coding to identify root causes of inconsistency (predominantly event-related CI race conditions, plus dependency and configuration issues). Based on their findings, they propose practical mitigations—standardizing CI configurations, automating dependency validation, improving test isolation, and establishing centralized cross-project tracking—and call for stronger ecosystem-level coordination to improve CI reliability and reduce wasted resources.

Abstract

Automated regression testing is a cornerstone of modern software development, often contributing directly to code review and Continuous Integration (CI). Yet some tests suffer from flakiness, where their outcomes vary non-deterministically. Flakiness erodes developer trust in test results, wastes computational resources, and undermines CI reliability. While prior research has examined test flakiness within individual projects, its broader ecosystem-wide impact remains largely unexplored. In this paper, we present an empirical study of test flakiness in the OpenStack ecosystem, which focuses on (1) cross-project flakiness, where flaky tests impact multiple projects, and (2) inconsistent flakiness, where a test exhibits flakiness in some projects but remains stable in others. By analyzing 649 OpenStack projects, we identify 1,535 cross-project flaky tests and 1,105 inconsistently flaky tests. We find that cross-project flakiness affects 55% of OpenStack projects and significantly increases both review time and computational costs. Surprisingly, 70% of unit tests exhibit cross-project flakiness, challenging the assumption that unit tests are inherently insulated from issues that span modules like integration and system-level tests. Through qualitative analysis, we observe that race conditions in CI, inconsistent build configurations, and dependency mismatches are the primary causes of inconsistent flakiness. These findings underline the need for better coordination across complex ecosystems, standardized CI configurations, and improved test isolation strategies.

Cross-Project Flakiness: A Case Study of the OpenStack Ecosystem

TL;DR

This paper investigates ecosystem-wide test flakiness in the OpenStack OpenStack ecosystem, focusing on cross-project flakiness (tests flaky across multiple projects) and inconsistent flakiness (tests flaky in some projects but stable in others). Using a data-driven case study of 649 OpenStack projects over one year, the authors quantify prevalence (about 55% of projects affected) and dissect how test scopes influence propagation, revealing that unit tests are heavily implicated in cross-project flakiness (about 70% show cross-project effects) while API and scenario tests also contribute. They combine quantitative analysis of flaky builds and tests with qualitative coding to identify root causes of inconsistency (predominantly event-related CI race conditions, plus dependency and configuration issues). Based on their findings, they propose practical mitigations—standardizing CI configurations, automating dependency validation, improving test isolation, and establishing centralized cross-project tracking—and call for stronger ecosystem-level coordination to improve CI reliability and reduce wasted resources.

Abstract

Automated regression testing is a cornerstone of modern software development, often contributing directly to code review and Continuous Integration (CI). Yet some tests suffer from flakiness, where their outcomes vary non-deterministically. Flakiness erodes developer trust in test results, wastes computational resources, and undermines CI reliability. While prior research has examined test flakiness within individual projects, its broader ecosystem-wide impact remains largely unexplored. In this paper, we present an empirical study of test flakiness in the OpenStack ecosystem, which focuses on (1) cross-project flakiness, where flaky tests impact multiple projects, and (2) inconsistent flakiness, where a test exhibits flakiness in some projects but remains stable in others. By analyzing 649 OpenStack projects, we identify 1,535 cross-project flaky tests and 1,105 inconsistently flaky tests. We find that cross-project flakiness affects 55% of OpenStack projects and significantly increases both review time and computational costs. Surprisingly, 70% of unit tests exhibit cross-project flakiness, challenging the assumption that unit tests are inherently insulated from issues that span modules like integration and system-level tests. Through qualitative analysis, we observe that race conditions in CI, inconsistent build configurations, and dependency mismatches are the primary causes of inconsistent flakiness. These findings underline the need for better coordination across complex ecosystems, standardized CI configurations, and improved test isolation strategies.
Paper Structure (20 sections, 2 equations, 5 figures, 5 tables)

This paper contains 20 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Real-world example of cross-project flakiness during the code review #877934 (Cinder project) and #882133 (Glance project) from OpenStack community.
  • Figure 2: Propagation of flakiness over time.
  • Figure 3: Parallel sets between test scopes and flakiness levels in OpenStack projects.
  • Figure 4: Failure rates of each Zuul job before, within, and after flaky range for inconsistent flakiness.
  • Figure 5: Time wasted (hours) of each Zuul job.