Table of Contents
Fetching ...

The Effects of Computational Resources on Flaky Tests

Denini Silva, Martin Gruber, Satyajit Gokhale, Ellen Arteca, Alexi Turcotte, Marcelo d'Amorim, Wing Lam, Stefan Winter, Jonathan Bell

TL;DR

It is hypothesize that, regardless of the underlying root cause of flakiness, it may be possible to reduce the rate of flaky failures by providing more computational resources to test-running infrastructure by providing more computational resources to test-running infrastructure.

Abstract

Flaky tests are tests that nondeterministically pass and fail in unchanged code. These tests can be detrimental to developers' productivity. Particularly when tests run in continuous integration environments, the tests may be competing for access to limited computational resources (CPUs, memory etc.), and we hypothesize that resource (in)availability may be a significant factor in the failure rate of flaky tests. We present the first assessment of the impact that computational resources have on flaky tests, including a total of 52 projects written in Java, JavaScript and Python, and 27 different resource configurations. Using a rigorous statistical methodology, we determine which tests are RAFT (Resource-Affected Flaky Tests). We find that 46.5% of the flaky tests in our dataset are RAFT, indicating that a substantial proportion of flaky-test failures can be avoided by adjusting the resources available when running tests. We report RAFTs and configurations to avoid them to developers, and received interest to either fix the RAFTs or to improve the specifications of the projects so that tests would be run only in configurations that are unlikely to encounter RAFT failures. Our results also have implications for researchers attempting to detect flaky tests, e.g., reducing the resources available when running tests is a cost-effective approach to detect more flaky failures.

The Effects of Computational Resources on Flaky Tests

TL;DR

It is hypothesize that, regardless of the underlying root cause of flakiness, it may be possible to reduce the rate of flaky failures by providing more computational resources to test-running infrastructure by providing more computational resources to test-running infrastructure.

Abstract

Flaky tests are tests that nondeterministically pass and fail in unchanged code. These tests can be detrimental to developers' productivity. Particularly when tests run in continuous integration environments, the tests may be competing for access to limited computational resources (CPUs, memory etc.), and we hypothesize that resource (in)availability may be a significant factor in the failure rate of flaky tests. We present the first assessment of the impact that computational resources have on flaky tests, including a total of 52 projects written in Java, JavaScript and Python, and 27 different resource configurations. Using a rigorous statistical methodology, we determine which tests are RAFT (Resource-Affected Flaky Tests). We find that 46.5% of the flaky tests in our dataset are RAFT, indicating that a substantial proportion of flaky-test failures can be avoided by adjusting the resources available when running tests. We report RAFTs and configurations to avoid them to developers, and received interest to either fix the RAFTs or to improve the specifications of the projects so that tests would be run only in configurations that are unlikely to encounter RAFT failures. Our results also have implications for researchers attempting to detect flaky tests, e.g., reducing the resources available when running tests is a cost-effective approach to detect more flaky failures.
Paper Structure (35 sections, 5 figures, 3 tables)

This paper contains 35 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Trade-off between the likelihood of observing RAFT failures and resource availability.
  • Figure 2: Example RAFT from the TestGetFunction class in the delight-nashorn-sandbox project GitHubDelightNashorn.
  • Figure 3: Just how resource affected are these flaky tests? For each project with flaky tests, we show the failure increase rate from the baseline "no throttling" configuration to the most failure-inducing resource throttling condition.
  • Figure 4: What are the best resource configurations to prevent flaky failures? For each configuration that we analyzed, we show the number of times that it was the best at avoiding flaky failures, the best in terms of price, or the best in terms of both. If a configuration was tied for best in terms of reliability for a project, we select the cheaper one. We hide configurations that were not optimal on either dimension.
  • Figure 5: What are the best resource configurations to detect flaky failures? For each configuration that we analyzed, we show the number of times that it was best at detecting flaky tests (number of unique flaky tests detected), the best in terms of price, or the best in terms of both. If a configuration was tied for best in terms of detection for a project, we select the cheaper one. We hide configurations that were not optimal on either dimension.