Table of Contents
Fetching ...

Fight Fire with Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks?

Xiao Yu, Lei Liu, Xing Hu, Jacky Wai Keung, Jin Liu, Xin Xia

TL;DR

This work systematically evaluates ChatGPT's self-verification capability across code generation, code completion, and program repair using three verification prompts. It finds that ChatGPT often misclassifies its own outputs, exhibits self-contradictory hallucinations, and that guiding questions can boost recall while test reports increase vulnerability detection but often yield inaccurate explanations. The study shows that prompt design can meaningfully affect self-verification performance, but human judgment remains essential for reliable software development. These findings motivate cautious integration of LLMs into developer workflows and point to directions in prompt engineering and evaluation for more trustworthy AI-assisted programming. The results hold practical significance for researchers and practitioners aiming to leverage LLMs while mitigating their current limitations.

Abstract

With the increasing utilization of large language models such as ChatGPT during software development, it has become crucial to verify the quality of code content it generates. Recent studies proposed utilizing ChatGPT as both a developer and tester for multi-agent collaborative software development. The multi-agent collaboration empowers ChatGPT to produce test reports for its generated code, enabling it to self-verify the code content and fix bugs based on these reports. However, these studies did not assess the effectiveness of the generated test reports in validating the code. Therefore, we conduct a comprehensive empirical investigation to evaluate ChatGPT's self-verification capability in code generation, code completion, and program repair. We request ChatGPT to (1) generate correct code and then self-verify its correctness; (2) complete code without vulnerabilities and then self-verify for the presence of vulnerabilities; and (3) repair buggy code and then self-verify whether the bugs are resolved. Our findings on two code generation datasets, one code completion dataset, and two program repair datasets reveal the following observations: (1) ChatGPT often erroneously predicts its generated incorrect code as correct. (2) The self-contradictory hallucinations in ChatGPT's behavior arise. (3) The self-verification capability of ChatGPT can be enhanced by asking the guiding question, which queries whether ChatGPT agrees with assertions about incorrectly generated or repaired code and vulnerabilities in completed code. (4) Using test reports generated by ChatGPT can identify more vulnerabilities in completed code, but the explanations for incorrectly generated code and failed repairs are mostly inaccurate in the test reports. Based on these findings, we provide implications for further research or development using ChatGPT.

Fight Fire with Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks?

TL;DR

This work systematically evaluates ChatGPT's self-verification capability across code generation, code completion, and program repair using three verification prompts. It finds that ChatGPT often misclassifies its own outputs, exhibits self-contradictory hallucinations, and that guiding questions can boost recall while test reports increase vulnerability detection but often yield inaccurate explanations. The study shows that prompt design can meaningfully affect self-verification performance, but human judgment remains essential for reliable software development. These findings motivate cautious integration of LLMs into developer workflows and point to directions in prompt engineering and evaluation for more trustworthy AI-assisted programming. The results hold practical significance for researchers and practitioners aiming to leverage LLMs while mitigating their current limitations.

Abstract

With the increasing utilization of large language models such as ChatGPT during software development, it has become crucial to verify the quality of code content it generates. Recent studies proposed utilizing ChatGPT as both a developer and tester for multi-agent collaborative software development. The multi-agent collaboration empowers ChatGPT to produce test reports for its generated code, enabling it to self-verify the code content and fix bugs based on these reports. However, these studies did not assess the effectiveness of the generated test reports in validating the code. Therefore, we conduct a comprehensive empirical investigation to evaluate ChatGPT's self-verification capability in code generation, code completion, and program repair. We request ChatGPT to (1) generate correct code and then self-verify its correctness; (2) complete code without vulnerabilities and then self-verify for the presence of vulnerabilities; and (3) repair buggy code and then self-verify whether the bugs are resolved. Our findings on two code generation datasets, one code completion dataset, and two program repair datasets reveal the following observations: (1) ChatGPT often erroneously predicts its generated incorrect code as correct. (2) The self-contradictory hallucinations in ChatGPT's behavior arise. (3) The self-verification capability of ChatGPT can be enhanced by asking the guiding question, which queries whether ChatGPT agrees with assertions about incorrectly generated or repaired code and vulnerabilities in completed code. (4) Using test reports generated by ChatGPT can identify more vulnerabilities in completed code, but the explanations for incorrectly generated code and failed repairs are mostly inaccurate in the test reports. Based on these findings, we provide implications for further research or development using ChatGPT.
Paper Structure (19 sections, 11 figures, 9 tables)

This paper contains 19 sections, 11 figures, 9 tables.

Figures (11)

  • Figure 1: The designed self-verification prompts for code generation.
  • Figure 2: The designed self-verification prompts for code completion.
  • Figure 3: The designed self-verification prompts for program repair.
  • Figure 4: The truly correctly generated code being predicted correct using direct question and incorrect using the guiding question and test report (i.e., Example 2).
  • Figure 5: The truly incorrectly generated code being predicted incorrect (i.e., Example 3).
  • ...and 6 more figures