Table of Contents
Fetching ...

Measuring the Influence of Incorrect Code on Test Generation

Dong Huang, Jie M. Zhang, Mark Harman, Mingzhe Du, Heming Cui

TL;DR

This work empirically quantifies how the correctness of the code under test in prompts affects Large Language Model–generated tests across multiple open-source and closed-source models and datasets. Using five prompt variants and evaluation on HumanEval, MBPP, APPS, BugsInPy, and SWE-Bench, the authors demonstrate that including correct code with a task description yields higher test accuracy, coverage, and bug-detection rates than prompts with incorrect code or task descriptions alone. Real-world code evaluations corroborate these trends, revealing a substantial drop in bug detection when prompts use incorrect code. The findings offer actionable guidance for practitioners aiming to rely on LLMs for automated testing and highlight areas for future work to improve resilience against incorrect code in real-world software development.

Abstract

It is natural to suppose that a Large Language Model is more likely to generate correct test cases when prompted with correct code under test, compared to incorrect code under test. However, the size of this effect has never been previously measured, despite its obvious importance for both practicing software engineers and researchers. To answer the question, we conducted a comprehensive empirical study on 5 open source and 6 closed source language models, with 3 widely-used benchmark data sets together with 41 repo-level real-world examples from two different real-world data sets. Our results reveal that, when compared to incorrect code under test, LLMs prompted with correct code achieve improvements in test accuracy, code coverage, and bug detection of 57\%, 12\%, and 24\% respectively. We further show that these scientific conclusions carry over from the three benchmark data sets to the real-world code, where tests generated for incorrect code experience a 47\% worse bug detection rate. Finally, we report that improvements of +18\% in accuracy, +4\% coverage, and +34\% in bug detection can be achieved by providing natural language code descriptions. These findings have actionable conclusions. For example, the 47\% reduction in real-world bug detection is a clear concern. Fortunately, it is a concern for which our findings about the added value of descriptions offer an immediately actionable remedy.

Measuring the Influence of Incorrect Code on Test Generation

TL;DR

This work empirically quantifies how the correctness of the code under test in prompts affects Large Language Model–generated tests across multiple open-source and closed-source models and datasets. Using five prompt variants and evaluation on HumanEval, MBPP, APPS, BugsInPy, and SWE-Bench, the authors demonstrate that including correct code with a task description yields higher test accuracy, coverage, and bug-detection rates than prompts with incorrect code or task descriptions alone. Real-world code evaluations corroborate these trends, revealing a substantial drop in bug detection when prompts use incorrect code. The findings offer actionable guidance for practitioners aiming to rely on LLMs for automated testing and highlight areas for future work to improve resilience against incorrect code in real-world software development.

Abstract

It is natural to suppose that a Large Language Model is more likely to generate correct test cases when prompted with correct code under test, compared to incorrect code under test. However, the size of this effect has never been previously measured, despite its obvious importance for both practicing software engineers and researchers. To answer the question, we conducted a comprehensive empirical study on 5 open source and 6 closed source language models, with 3 widely-used benchmark data sets together with 41 repo-level real-world examples from two different real-world data sets. Our results reveal that, when compared to incorrect code under test, LLMs prompted with correct code achieve improvements in test accuracy, code coverage, and bug detection of 57\%, 12\%, and 24\% respectively. We further show that these scientific conclusions carry over from the three benchmark data sets to the real-world code, where tests generated for incorrect code experience a 47\% worse bug detection rate. Finally, we report that improvements of +18\% in accuracy, +4\% coverage, and +34\% in bug detection can be achieved by providing natural language code descriptions. These findings have actionable conclusions. For example, the 47\% reduction in real-world bug detection is a clear concern. Fortunately, it is a concern for which our findings about the added value of descriptions offer an immediately actionable remedy.
Paper Structure (38 sections, 4 figures, 11 tables)

This paper contains 38 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: RQ1.1: Accuracy of LLM-generated test cases across HumanEval, MBPP, and APPS datasets using different prompts at test level and task level.
  • Figure 2: RQ1.2: Coverage of LLM-generated test cases.
  • Figure 3: Correlation between the code generation capability of LLMs and their ease of being misled during test generation in the MBPP dataset.
  • Figure 4: CodeBLEU scores of GPT-3.5-turbo generated test cases across five executions.