Table of Contents
Fetching ...

Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures

Owain Parry, Gregory Kapfhammer, Michael Hilton, Phil McMinn

TL;DR

The paper identifies systemic flakiness as the tendency of flaky tests to fail together in clusters due to shared root causes, challenging the view that flaky failures are isolated. Using a large, real-world dataset of 10,000 test-suite runs across 24 Java projects, the authors apply agglomerative clustering on failure co-occurrence and show that about 75% of flaky tests participate in clusters with a mean size of 13.5, across 45 clusters. They further demonstrate that machine learning models trained on static test-case distance measures can predict systemic flakiness with an average $R^2$ of $0.74$ (regression) and $MCC$ of $0.74$ (classification), with the hierarchy distance being the most influential feature. Manual analysis of clusters reveals intermittent networking issues and external dependency instabilities as the predominant causes. The findings have practical implications for reducing repair costs by addressing shared root causes and point to future work in automated detection, cross-language generalization, and automated root-cause analysis.

Abstract

Flaky tests produce inconsistent outcomes without code changes, creating major challenges for software developers. An industrial case study reported that developers spend 1.28% of their time repairing flaky tests at a monthly cost of $2,250. We discovered that flaky tests often exist in clusters, with co-occurring failures that share the same root causes, which we call systemic flakiness. This suggests that developers can reduce repair costs by addressing shared root causes, enabling them to fix multiple flaky tests at once rather than tackling them individually. This study represents an inflection point by challenging the deep-seated assumption that flaky test failures are isolated occurrences. We used an established dataset of 10,000 test suite runs from 24 Java projects on GitHub, spanning domains from data orchestration to job scheduling. It contains 810 flaky tests, which we levered to perform a mixed-method empirical analysis of co-occurring flaky test failures. Systemic flakiness is significant and widespread. We performed agglomerative clustering of flaky tests based on their failure co-occurrence, finding that 75% of flaky tests across all projects belong to a cluster, with a mean cluster size of 13.5 flaky tests. Instead of requiring 10,000 test suite runs to identify systemic flakiness, we demonstrated a lightweight alternative by training machine learning models based on static test case distance measures. Through manual inspection of stack traces, conducted independently by four authors and resolved through negotiated agreement, we identified intermittent networking issues and instabilities in external dependencies as the predominant causes of systemic flakiness.

Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures

TL;DR

The paper identifies systemic flakiness as the tendency of flaky tests to fail together in clusters due to shared root causes, challenging the view that flaky failures are isolated. Using a large, real-world dataset of 10,000 test-suite runs across 24 Java projects, the authors apply agglomerative clustering on failure co-occurrence and show that about 75% of flaky tests participate in clusters with a mean size of 13.5, across 45 clusters. They further demonstrate that machine learning models trained on static test-case distance measures can predict systemic flakiness with an average of (regression) and of (classification), with the hierarchy distance being the most influential feature. Manual analysis of clusters reveals intermittent networking issues and external dependency instabilities as the predominant causes. The findings have practical implications for reducing repair costs by addressing shared root causes and point to future work in automated detection, cross-language generalization, and automated root-cause analysis.

Abstract

Flaky tests produce inconsistent outcomes without code changes, creating major challenges for software developers. An industrial case study reported that developers spend 1.28% of their time repairing flaky tests at a monthly cost of $2,250. We discovered that flaky tests often exist in clusters, with co-occurring failures that share the same root causes, which we call systemic flakiness. This suggests that developers can reduce repair costs by addressing shared root causes, enabling them to fix multiple flaky tests at once rather than tackling them individually. This study represents an inflection point by challenging the deep-seated assumption that flaky test failures are isolated occurrences. We used an established dataset of 10,000 test suite runs from 24 Java projects on GitHub, spanning domains from data orchestration to job scheduling. It contains 810 flaky tests, which we levered to perform a mixed-method empirical analysis of co-occurring flaky test failures. Systemic flakiness is significant and widespread. We performed agglomerative clustering of flaky tests based on their failure co-occurrence, finding that 75% of flaky tests across all projects belong to a cluster, with a mean cluster size of 13.5 flaky tests. Instead of requiring 10,000 test suite runs to identify systemic flakiness, we demonstrated a lightweight alternative by training machine learning models based on static test case distance measures. Through manual inspection of stack traces, conducted independently by four authors and resolved through negotiated agreement, we identified intermittent networking issues and instabilities in external dependencies as the predominant causes of systemic flakiness.

Paper Structure

This paper contains 17 sections, 2 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Two flaky tests from the apache-ambari project that form a Networking cluster. They both failed after calling the authenticate method during the exact same 8 test suite runs out of 10,000. In both cases, the root exception was java.net.ConnectException: Connection refused (Connection refused).
  • Figure 2: Dendrograms for three projects illustrating the hierarchy of the clusters prior to extracting a concrete clustering. The vertical axis shows the Jaccard distance at which clusters are merged (see Equation \ref{['equ:jaccard']}). The dotted line represents the distance threshold that produces the concrete clustering with the greatest mean silhouette score (see Equation \ref{['equ:silhouette']}).