Table of Contents
Fetching ...

Can Large Language Models Write Good Property-Based Tests?

Vasudev Vikram, Caroline Lemieux, Joshua Sunshine, Rohan Padhye

TL;DR

The paper tackles the challenge of adopting property-based testing by leveraging API documentation and large language models to automatically synthesize PBTs. It introduces Proptest-AI with single-stage and two-stage prompting strategies and an evaluation framework focusing on validity, soundness, and property coverage via property mutants. Empirical results across 40 Python API methods and three LLMs show that GPT-4 with two-stage prompting yields valid and sound PBTs on average in 2.4 samples, with about 20.5% property coverage. The work demonstrates a feasible direction for automating substantial portions of PBT authoring and provides a rigorous methodology for evaluating the quality of LLM-generated tests.

Abstract

Property-based testing (PBT), while an established technique in the software testing research community, is still relatively underused in real-world software. Pain points in writing property-based tests include implementing diverse random input generators and thinking of meaningful properties to test. Developers, however, are more amenable to writing documentation; plenty of library API documentation is available and can be used as natural language specifications for PBTs. As large language models (LLMs) have recently shown promise in a variety of coding tasks, we investigate using modern LLMs to automatically synthesize PBTs using two prompting techniques. A key challenge is to rigorously evaluate the LLM-synthesized PBTs. We propose a methodology to do so considering several properties of the generated tests: (1) validity, (2) soundness, and (3) property coverage, a novel metric that measures the ability of the PBT to detect property violations through generation of property mutants. In our evaluation on 40 Python library API methods across three models (GPT-4, Gemini-1.5-Pro, Claude-3-Opus), we find that with the best model and prompting approach, a valid and sound PBT can be synthesized in 2.4 samples on average. We additionally find that our metric for determining soundness of a PBT is aligned with human judgment of property assertions, achieving a precision of 100% and recall of 97%. Finally, we evaluate the property coverage of LLMs across all API methods and find that the best model (GPT-4) is able to automatically synthesize correct PBTs for 21% of properties extractable from API documentation.

Can Large Language Models Write Good Property-Based Tests?

TL;DR

The paper tackles the challenge of adopting property-based testing by leveraging API documentation and large language models to automatically synthesize PBTs. It introduces Proptest-AI with single-stage and two-stage prompting strategies and an evaluation framework focusing on validity, soundness, and property coverage via property mutants. Empirical results across 40 Python API methods and three LLMs show that GPT-4 with two-stage prompting yields valid and sound PBTs on average in 2.4 samples, with about 20.5% property coverage. The work demonstrates a feasible direction for automating substantial portions of PBT authoring and provides a rigorous methodology for evaluating the quality of LLM-generated tests.

Abstract

Property-based testing (PBT), while an established technique in the software testing research community, is still relatively underused in real-world software. Pain points in writing property-based tests include implementing diverse random input generators and thinking of meaningful properties to test. Developers, however, are more amenable to writing documentation; plenty of library API documentation is available and can be used as natural language specifications for PBTs. As large language models (LLMs) have recently shown promise in a variety of coding tasks, we investigate using modern LLMs to automatically synthesize PBTs using two prompting techniques. A key challenge is to rigorously evaluate the LLM-synthesized PBTs. We propose a methodology to do so considering several properties of the generated tests: (1) validity, (2) soundness, and (3) property coverage, a novel metric that measures the ability of the PBT to detect property violations through generation of property mutants. In our evaluation on 40 Python library API methods across three models (GPT-4, Gemini-1.5-Pro, Claude-3-Opus), we find that with the best model and prompting approach, a valid and sound PBT can be synthesized in 2.4 samples on average. We additionally find that our metric for determining soundness of a PBT is aligned with human judgment of property assertions, achieving a precision of 100% and recall of 97%. Finally, we evaluate the property coverage of LLMs across all API methods and find that the best model (GPT-4) is able to automatically synthesize correct PBTs for 21% of properties extractable from API documentation.
Paper Structure (26 sections, 1 equation, 11 figures, 5 tables)

This paper contains 26 sections, 1 equation, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Truncated Numpy documentation for the numpy.cumsum API method. The documentation includes descriptions of properties about the result shape/size and additional information about the last element of the result.
  • Figure 2: A GPT-4 generated property based test for numpy.cumsum. The test first generates random integer arrays between size 1 and 20 and a random axis. Then, the API method under test np.cumsum is invoked on the randomly generated inputs. Finally, three properties are checked on the output array, all derived from information in the API documentation. All comments are also generated by GPT-4.
  • Figure 3: Example property-based tests in Hypothesis for the Python sorted function to sort lists. The test_sorted_separate function uses a separate generator, whereas the function test_sorted_combined combines the generator and testing logic into one function.
  • Figure 4: Two methods of generating property-based test using an LLM. The first is a single stage prompt of the LLM with zero-shot CoT instructions to (1) explain a generation strategy, (2) properties to test, and (3) generate a single property-based test. The second method instructs the LLM to extract a list of properties from the API docs and continues the conversation, instructing the LLM to write a test for each property.
  • Figure 5: An example invalid datetime.timedelta PBT produced by GPT-4. The datetime.timedelta constructor on line \ref{['timedelta_constructor']} raises an OverflowError when the absolute magnitude of days exceeds 1,000,000.
  • ...and 6 more figures