Table of Contents
Fetching ...

UTFix: Change Aware Unit Test Repairing using LLM

Shanto Rahman, Sachit Kuhar, Berk Cirisci, Pranav Garg, Shiqi Wang, Xiaofei Ma, Anoop Deoras, Baishakhi Ray

TL;DR

This work addresses the problem of tests becoming outdated when focal methods evolve. It presents UTFix, a change-aware, LLM-guided approach that repairs unit tests in two failure modes: assertion failures and reduced code coverage, using static and dynamic code slices to inform prompt generation. The authors introduce Tool-Bench and a real-world benchmark to evaluate UTFix on Python projects, showing substantial repair rates for assertion failures (up to 89.2% in synthetic data) and strong improvements in code coverage (e.g., many tests reaching 100%). They demonstrate that integrating both static and dynamic slices with iterative failure-feedback prompts enhances repair effectiveness, and they release benchmarks and tooling to support future research. The findings highlight the practical potential of LLM-based test repair in evolving software and point to useful directions for broader language support and improved prompting strategies.

Abstract

Software updates, including bug repair and feature additions, are frequent in modern applications but they often leave test suites outdated, resulting in undetected bugs and increased chances of system failures. A recent study by Meta revealed that 14%-22% of software failures stem from outdated tests that fail to reflect changes in the codebase. This highlights the need to keep tests in sync with code changes to ensure software reliability. In this paper, we present UTFix, a novel approach for repairing unit tests when their corresponding focal methods undergo changes. UTFix addresses two critical issues: assertion failure and reduced code coverage caused by changes in the focal method. Our approach leverages language models to repair unit tests by providing contextual information such as static code slices, dynamic code slices, and failure messages. We evaluate UTFix on our generated synthetic benchmarks (Tool-Bench), and real-world benchmarks. Tool- Bench includes diverse changes from popular open-source Python GitHub projects, where UTFix successfully repaired 89.2% of assertion failures and achieved 100% code coverage for 96 tests out of 369 tests. On the real-world benchmarks, UTFix repairs 60% of assertion failures while achieving 100% code coverage for 19 out of 30 unit tests. To the best of our knowledge, this is the first comprehensive study focused on unit test in evolving Python projects. Our contributions include the development of UTFix, the creation of Tool-Bench and real-world benchmarks, and the demonstration of the effectiveness of LLM-based methods in addressing unit test failures due to software evolution.

UTFix: Change Aware Unit Test Repairing using LLM

TL;DR

This work addresses the problem of tests becoming outdated when focal methods evolve. It presents UTFix, a change-aware, LLM-guided approach that repairs unit tests in two failure modes: assertion failures and reduced code coverage, using static and dynamic code slices to inform prompt generation. The authors introduce Tool-Bench and a real-world benchmark to evaluate UTFix on Python projects, showing substantial repair rates for assertion failures (up to 89.2% in synthetic data) and strong improvements in code coverage (e.g., many tests reaching 100%). They demonstrate that integrating both static and dynamic slices with iterative failure-feedback prompts enhances repair effectiveness, and they release benchmarks and tooling to support future research. The findings highlight the practical potential of LLM-based test repair in evolving software and point to useful directions for broader language support and improved prompting strategies.

Abstract

Software updates, including bug repair and feature additions, are frequent in modern applications but they often leave test suites outdated, resulting in undetected bugs and increased chances of system failures. A recent study by Meta revealed that 14%-22% of software failures stem from outdated tests that fail to reflect changes in the codebase. This highlights the need to keep tests in sync with code changes to ensure software reliability. In this paper, we present UTFix, a novel approach for repairing unit tests when their corresponding focal methods undergo changes. UTFix addresses two critical issues: assertion failure and reduced code coverage caused by changes in the focal method. Our approach leverages language models to repair unit tests by providing contextual information such as static code slices, dynamic code slices, and failure messages. We evaluate UTFix on our generated synthetic benchmarks (Tool-Bench), and real-world benchmarks. Tool- Bench includes diverse changes from popular open-source Python GitHub projects, where UTFix successfully repaired 89.2% of assertion failures and achieved 100% code coverage for 96 tests out of 369 tests. On the real-world benchmarks, UTFix repairs 60% of assertion failures while achieving 100% code coverage for 19 out of 30 unit tests. To the best of our knowledge, this is the first comprehensive study focused on unit test in evolving Python projects. Our contributions include the development of UTFix, the creation of Tool-Bench and real-world benchmarks, and the demonstration of the effectiveness of LLM-based methods in addressing unit test failures due to software evolution.

Paper Structure

This paper contains 46 sections, 2 equations, 16 figures.

Figures (16)

  • Figure 1: Assertion failure example
  • Figure 2: Code coverage example
  • Figure 3: UTFix Framework
  • Figure 4: Example of an assertion failure, highlighting the lines taken as feedback. Only the failure points starting with '>' and lines starting with 'E' as 'Error' tag are considered to pinpoint the issue.
  • Figure 5: An example of static slice
  • ...and 11 more figures