Table of Contents
Fetching ...

A Generic Approach to Fix Test Flakiness in Real-World Projects

Yang Chen, Reyhaneh Jabbarvand

TL;DR

The paper tackles the challenge of test flakiness by presenting FlakyDoctor, a neuro symbolic method that unites LLM based generalization with sound program analysis to repair both OD and ID flaky tests. It introduces a four component architecture—Inspector, Prompt Generator, Tailor, and Validator—with a feedback loop that iteratively refines patches. Evaluation on 873 flaky tests from 243 real world projects shows substantial repair rates for both ID and OD categories and demonstrates that combining neural and symbolic techniques yields repairs unreachable by prior symbolic approaches alone. The work highlights the importance of precise bug localization, targeted prompt crafting, and stitching to manage compilation issues, delivering practical improvements including 79 previously unfixed repairs and several merged pull requests. This approach broadens the applicability of automated flaky test repair and suggests a productive direction for integrating LLMs with static analysis in software maintenance.

Abstract

Test flakiness, a non-deterministic behavior of builds irrelevant to code changes, is a major and continuing impediment to delivering reliable software. The very few techniques for the automated repair of test flakiness are specifically crafted to repair either Order-Dependent (OD) or Implementation-Dependent (ID) flakiness. They are also all symbolic approaches, i.e., leverage program analysis to detect and repair known test flakiness patterns and root causes, failing to generalize. To bridge the gap, we propose FlakyDoctor, a neuro-symbolic technique that combines the power of LLMs-generalizability-and program analysis-soundness-to fix different types of test flakiness. Our extensive evaluation using 873 confirmed flaky tests (332 OD and 541 ID) from 243 real-world projects demonstrates the ability of FlakyDoctor in repairing flakiness, achieving 57% (OD) and 59% (ID) success rate. Comparing to three alternative flakiness repair approaches, FlakyDoctor can repair 8% more ID tests than DexFix, 12% more OD flaky tests than ODRepair, and 17% more OD flaky tests than iFixFlakies. Regardless of underlying LLM, the non-LLM components of FlakyDoctor contribute to 12-31% of the overall performance, i.e., while part of the FlakyDoctor power is from using LLMs, they are not good enough to repair flaky tests in real-world projects alone. What makes the proposed technique superior to related research on test flakiness mitigation specifically and program repair, in general, is repairing 79 previously unfixed flaky tests in real-world projects. We opened pull requests for all cases with corresponding patches; 19 of them were accepted and merged at the time of submission.

A Generic Approach to Fix Test Flakiness in Real-World Projects

TL;DR

The paper tackles the challenge of test flakiness by presenting FlakyDoctor, a neuro symbolic method that unites LLM based generalization with sound program analysis to repair both OD and ID flaky tests. It introduces a four component architecture—Inspector, Prompt Generator, Tailor, and Validator—with a feedback loop that iteratively refines patches. Evaluation on 873 flaky tests from 243 real world projects shows substantial repair rates for both ID and OD categories and demonstrates that combining neural and symbolic techniques yields repairs unreachable by prior symbolic approaches alone. The work highlights the importance of precise bug localization, targeted prompt crafting, and stitching to manage compilation issues, delivering practical improvements including 79 previously unfixed repairs and several merged pull requests. This approach broadens the applicability of automated flaky test repair and suggests a productive direction for integrating LLMs with static analysis in software maintenance.

Abstract

Test flakiness, a non-deterministic behavior of builds irrelevant to code changes, is a major and continuing impediment to delivering reliable software. The very few techniques for the automated repair of test flakiness are specifically crafted to repair either Order-Dependent (OD) or Implementation-Dependent (ID) flakiness. They are also all symbolic approaches, i.e., leverage program analysis to detect and repair known test flakiness patterns and root causes, failing to generalize. To bridge the gap, we propose FlakyDoctor, a neuro-symbolic technique that combines the power of LLMs-generalizability-and program analysis-soundness-to fix different types of test flakiness. Our extensive evaluation using 873 confirmed flaky tests (332 OD and 541 ID) from 243 real-world projects demonstrates the ability of FlakyDoctor in repairing flakiness, achieving 57% (OD) and 59% (ID) success rate. Comparing to three alternative flakiness repair approaches, FlakyDoctor can repair 8% more ID tests than DexFix, 12% more OD flaky tests than ODRepair, and 17% more OD flaky tests than iFixFlakies. Regardless of underlying LLM, the non-LLM components of FlakyDoctor contribute to 12-31% of the overall performance, i.e., while part of the FlakyDoctor power is from using LLMs, they are not good enough to repair flaky tests in real-world projects alone. What makes the proposed technique superior to related research on test flakiness mitigation specifically and program repair, in general, is repairing 79 previously unfixed flaky tests in real-world projects. We opened pull requests for all cases with corresponding patches; 19 of them were accepted and merged at the time of submission.
Paper Structure (23 sections, 6 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Example of a previously unfixed OD flakiness in Elasticjob shardingsphere-elasticjob repaired by FlakyDoctor that cannot be repaired by alternative approaches
  • Figure 2: Example of a previously unfixed ID flakiness in Hadoop hadoop repaired by FlakyDoctor that cannot be repaired by alternative approaches
  • Figure 3: Overview of FlakyDoctor for repairing test flakiness
  • Figure 4: Prompt templates for repairing OD-Victim (a) and ID (b) test flakiness
  • Figure 5: Comparison between the correct patches generated by different approaches. Sub-figures a-b compare OD-Victim, c-d compare OD-Brittle, and e-f compare ID patches
  • ...and 1 more figures