Table of Contents
Fetching ...

TestForge: Feedback-Driven, Agentic Test Suite Generation

Kush Jain, Claire Le Goues

TL;DR

TestForge introduces an agentic, feedback-driven framework for automated unit-test generation that iteratively refines a zero-shot test suite using execution and coverage feedback at the file level. By operating within OpenHands and using a cost-aware loop, TestForge achieves state-of-the-art metrics on the TestGenEval benchmark (pass@1 around 84%, line coverage ~44%, mutation score ~34%), while maintaining low cost (~$0.63 per file). The approach outperforms classical genetic-programming baselines and one-shot LLM baselines, and it yields more readable and maintainable tests than prior methods. The work demonstrates how dynamic feedback and planning can scale high-quality test generation to large, real-world codebases and provides reproducible benchmarks through OpenHands integration.

Abstract

Automated test generation holds great promise for alleviating the burdens of manual test creation. However, existing search-based techniques compromise on test readability, while LLM-based approaches are prohibitively expensive in practice. We present TestForge, an agentic unit testing framework designed to cost-effectively generate high-quality test suites for real-world code. Our key insight is to reframe LLM-based test generation as an iterative process. TestForge thus begins with tests generated via zero-shot prompting, and then continuously refines those tests based on feedback from test executions and coverage reports. We evaluate TestForge on TestGenEval, a real world unit test generation benchmark sourced from 11 large scale open source repositories; we show that TestForge achieves a pass@1 rate of 84.3%, 44.4% line coverage and 33.8% mutation score on average, outperforming prior classical approaches and a one-iteration LLM-based baseline. TestForge produces more natural and understandable tests compared to state-of-the-art search-based techniques, and offers substantial cost savings over LLM-based techniques (at $0.63 per file). Finally, we release a version of TestGenEval integrated with the OpenHands platform, a popular open-source framework featuring a diverse set of software engineering agents and agentic benchmarks, for future extension and development.

TestForge: Feedback-Driven, Agentic Test Suite Generation

TL;DR

TestForge introduces an agentic, feedback-driven framework for automated unit-test generation that iteratively refines a zero-shot test suite using execution and coverage feedback at the file level. By operating within OpenHands and using a cost-aware loop, TestForge achieves state-of-the-art metrics on the TestGenEval benchmark (pass@1 around 84%, line coverage ~44%, mutation score ~34%), while maintaining low cost (~$0.63 per file). The approach outperforms classical genetic-programming baselines and one-shot LLM baselines, and it yields more readable and maintainable tests than prior methods. The work demonstrates how dynamic feedback and planning can scale high-quality test generation to large, real-world codebases and provides reproducible benchmarks through OpenHands integration.

Abstract

Automated test generation holds great promise for alleviating the burdens of manual test creation. However, existing search-based techniques compromise on test readability, while LLM-based approaches are prohibitively expensive in practice. We present TestForge, an agentic unit testing framework designed to cost-effectively generate high-quality test suites for real-world code. Our key insight is to reframe LLM-based test generation as an iterative process. TestForge thus begins with tests generated via zero-shot prompting, and then continuously refines those tests based on feedback from test executions and coverage reports. We evaluate TestForge on TestGenEval, a real world unit test generation benchmark sourced from 11 large scale open source repositories; we show that TestForge achieves a pass@1 rate of 84.3%, 44.4% line coverage and 33.8% mutation score on average, outperforming prior classical approaches and a one-iteration LLM-based baseline. TestForge produces more natural and understandable tests compared to state-of-the-art search-based techniques, and offers substantial cost savings over LLM-based techniques (at $0.63 per file). Finally, we release a version of TestGenEval integrated with the OpenHands platform, a popular open-source framework featuring a diverse set of software engineering agents and agentic benchmarks, for future extension and development.

Paper Structure

This paper contains 33 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Example of tests generated by different approaches for timing and rounding functionality in pydata/xarray. GPT-4o generates a buggy test, which TestForge fixes, while also adding additional coverage improving tests.
  • Figure 2: Overview of TestForge. We start by generating a zero-shot test suite and allowing our agent to interact with the repository with the generated test suite. We include the ability to search code, view code, write and edit files along with environment capabilities to run commands and tests. The output is a full test suite for the code file under test.
  • Figure 3: Code and test lengths across TestGenEval, HumanEvalFix, CAT-LM, TestEval. Code and test files in TestGenEval are significantly longer than other benchmarks (even with the log scale). TestEval is not included in the test lengths plot, as it does not contain "gold" tests.
  • Figure 4: Coverage and pass@1 in comparison to number of iterations. Both metrics have diminishing returns as k, increases, indicating k=25 is a good value.
  • Figure 5: Frequency of TestForge commands taken while generating tests for all 1210 programs in TestGenEval. The most common actions are editing and executing the generated test suite, indicating an iterative approach to test suite refinement.
  • ...and 1 more figures