Table of Contents
Fetching ...

PyTester: Deep Reinforcement Learning for Text-to-Testcase Generation

Wannita Takerngsaksiri, Rujikorn Charakorn, Chakkrit Tantithamthavorn, Yuan-Fang Li

TL;DR

This work introduces PyTester, a Deep Reinforcement Learning framework for Text-to-Testcase generation that creates syntactically correct, executable, complete, and effective Python tests from natural language requirements. By formulating the task as an RL problem and optimizing with Proximal Policy Optimization, PyTester integrates domain knowledge through a reward function that combines syntax correctness, test executability, and code coverage, yielding superior performance on the APPS benchmark compared to several large LLMs. The approach demonstrates that a small language model can outperform significantly larger models when guided by task-specific feedback, achieving $99\%$ syntax correctness, $84\%$ passing rate, $80\%$ code coverage, and $61\%$ mutation score, while offering substantially faster inference. The results highlight the value of incorporating SE domain knowledge into RL architectures and suggest broader potential for resource-efficient text-to-testcase generation in real-world TDD workflows.

Abstract

Test-driven development (TDD) is a widely-employed software development practice that mandates writing test cases based on requirements before writing the actual code. While writing test cases is the centerpiece of TDD, it is time-consuming, expensive, and often shunned by developers. To address these issues associated with TDD, automated test case generation approaches have recently been investigated. Such approaches take source code as input, but not the requirements. Therefore, existing work does not fully support true TDD, as actual code is required to generate test cases. In addition, current deep learning-based test case generation approaches are trained with one learning objective, i.e., to generate test cases that are exactly matched with the ground-truth test cases. However, such approaches may limit the model's ability to generate different yet correct test cases. In this paper, we introduce PyTester, a Text-to-Testcase generation approach that can automatically generate syntactically correct, executable, complete, and effective test cases while being aligned with a given natural language requirement. We evaluate PyTester on the public APPS benchmark dataset, and the results show that our Deep RL approach enables PyTester, a small language model, to outperform much larger language models like GPT3.5, StarCoder, and InCoder. Our findings suggest that future research could consider improving small over large LMs for better resource efficiency by integrating the SE domain knowledge into the design of reinforcement learning architecture.

PyTester: Deep Reinforcement Learning for Text-to-Testcase Generation

TL;DR

This work introduces PyTester, a Deep Reinforcement Learning framework for Text-to-Testcase generation that creates syntactically correct, executable, complete, and effective Python tests from natural language requirements. By formulating the task as an RL problem and optimizing with Proximal Policy Optimization, PyTester integrates domain knowledge through a reward function that combines syntax correctness, test executability, and code coverage, yielding superior performance on the APPS benchmark compared to several large LLMs. The approach demonstrates that a small language model can outperform significantly larger models when guided by task-specific feedback, achieving syntax correctness, passing rate, code coverage, and mutation score, while offering substantially faster inference. The results highlight the value of incorporating SE domain knowledge into RL architectures and suggest broader potential for resource-efficient text-to-testcase generation in real-world TDD workflows.

Abstract

Test-driven development (TDD) is a widely-employed software development practice that mandates writing test cases based on requirements before writing the actual code. While writing test cases is the centerpiece of TDD, it is time-consuming, expensive, and often shunned by developers. To address these issues associated with TDD, automated test case generation approaches have recently been investigated. Such approaches take source code as input, but not the requirements. Therefore, existing work does not fully support true TDD, as actual code is required to generate test cases. In addition, current deep learning-based test case generation approaches are trained with one learning objective, i.e., to generate test cases that are exactly matched with the ground-truth test cases. However, such approaches may limit the model's ability to generate different yet correct test cases. In this paper, we introduce PyTester, a Text-to-Testcase generation approach that can automatically generate syntactically correct, executable, complete, and effective test cases while being aligned with a given natural language requirement. We evaluate PyTester on the public APPS benchmark dataset, and the results show that our Deep RL approach enables PyTester, a small language model, to outperform much larger language models like GPT3.5, StarCoder, and InCoder. Our findings suggest that future research could consider improving small over large LMs for better resource efficiency by integrating the SE domain knowledge into the design of reinforcement learning architecture.
Paper Structure (22 sections, 8 equations, 5 figures, 5 tables)

This paper contains 22 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: An example of test cases generated by the OpenAI's GPT3.5 model based on a given description input. Syntax correctness is measured by an AST parser, while the test executability is measured by running the test cases against the ground truth code.
  • Figure 2: The formulation of Text-to-Testcase generation as a Reinforcement Learning problem.
  • Figure 3: An overview of our Deep Reinforcement Learning framework for the Text-to-Testcase Generation task (called PyTester). $||$ denotes a concatenation, $+$ denotes an addition, and $-$ denotes a subtraction.
  • Figure 4: An example of test cases generated by PyTester and other baseline approaches (depicted from Problem ID #569 of the testing dataset).
  • Figure 5: (RQ3) The common types of runtime errors that occur in the PyTester-generated test cases.