Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures
Owain Parry, Gregory Kapfhammer, Michael Hilton, Phil McMinn
TL;DR
The paper identifies systemic flakiness as the tendency of flaky tests to fail together in clusters due to shared root causes, challenging the view that flaky failures are isolated. Using a large, real-world dataset of 10,000 test-suite runs across 24 Java projects, the authors apply agglomerative clustering on failure co-occurrence and show that about 75% of flaky tests participate in clusters with a mean size of 13.5, across 45 clusters. They further demonstrate that machine learning models trained on static test-case distance measures can predict systemic flakiness with an average $R^2$ of $0.74$ (regression) and $MCC$ of $0.74$ (classification), with the hierarchy distance being the most influential feature. Manual analysis of clusters reveals intermittent networking issues and external dependency instabilities as the predominant causes. The findings have practical implications for reducing repair costs by addressing shared root causes and point to future work in automated detection, cross-language generalization, and automated root-cause analysis.
Abstract
Flaky tests produce inconsistent outcomes without code changes, creating major challenges for software developers. An industrial case study reported that developers spend 1.28% of their time repairing flaky tests at a monthly cost of $2,250. We discovered that flaky tests often exist in clusters, with co-occurring failures that share the same root causes, which we call systemic flakiness. This suggests that developers can reduce repair costs by addressing shared root causes, enabling them to fix multiple flaky tests at once rather than tackling them individually. This study represents an inflection point by challenging the deep-seated assumption that flaky test failures are isolated occurrences. We used an established dataset of 10,000 test suite runs from 24 Java projects on GitHub, spanning domains from data orchestration to job scheduling. It contains 810 flaky tests, which we levered to perform a mixed-method empirical analysis of co-occurring flaky test failures. Systemic flakiness is significant and widespread. We performed agglomerative clustering of flaky tests based on their failure co-occurrence, finding that 75% of flaky tests across all projects belong to a cluster, with a mean cluster size of 13.5 flaky tests. Instead of requiring 10,000 test suite runs to identify systemic flakiness, we demonstrated a lightweight alternative by training machine learning models based on static test case distance measures. Through manual inspection of stack traces, conducted independently by four authors and resolved through negotiated agreement, we identified intermittent networking issues and instabilities in external dependencies as the predominant causes of systemic flakiness.
