230,439 Test Failures Later: An Empirical Evaluation of Flaky Failure Classifiers

Abdulrahman Alshammari; Paul Ammann; Michael Hilton; Jonathan Bell

230,439 Test Failures Later: An Empirical Evaluation of Flaky Failure Classifiers

Abdulrahman Alshammari, Paul Ammann, Michael Hilton, Jonathan Bell

TL;DR

This work addresses the challenge of distinguishing flaky test failures from true defects to reduce costly reruns. It builds a large, ground-truth dataset from 22 open-source Java projects by combining flaky failures with deterministically reproduced true failures via mutation testing, and evaluates three failure-de-duplication approaches: text-based matching, a Failure Log Classifier, and TF-IDF. The findings reveal that flaky failures are often highly repetitive, but cross-project performance of deduplication is variable; TF-IDF generally performs best, with machine-learning approaches offering mixed gains depending on project log richness. The study provides practical guidance on when de-duplication can reliably aid triage and releases open data and tooling to spur further research in flaky failure detection.

Abstract

Flaky tests are tests that can non-deterministically pass or fail, even in the absence of code changes.Despite being a source of false alarms, flaky tests often remain in test suites once they are detected, as they also may be relied upon to detect true failures. Hence, a key open problem in flaky test research is: How to quickly determine if a test failed due to flakiness, or if it detected a bug? The state-of-the-practice is for developers to re-run failing tests: if a test fails and then passes, it is flaky by definition; if the test persistently fails, it is likely a true failure. However, this approach can be both ineffective and inefficient. An alternate approach that developers may already use for triaging test failures is failure de-duplication, which matches newly discovered test failures to previously witnessed flaky and true failures. However, because flaky test failure symptoms might resemble those of true failures, there is a risk of missclassifying a true test failure as a flaky failure to be ignored. Using a dataset of 498 flaky tests from 22 open-source Java projects, we collect a large dataset of 230,439 failure messages (both flaky and not), allowing us to empirically investigate the efficacy of failure de-duplication. We find that for some projects, this approach is extremely effective (with 100\% specificity), while for other projects, the approach is entirely ineffective. By analyzing the characteristics of these flaky and non-flaky failures, we provide useful guidance on how developers should rely on this approach.

230,439 Test Failures Later: An Empirical Evaluation of Flaky Failure Classifiers

TL;DR

Abstract

230,439 Test Failures Later: An Empirical Evaluation of Flaky Failure Classifiers

Authors

TL;DR

Abstract

Table of Contents