Table of Contents
Fetching ...

Identifying Inaccurate Descriptions in LLM-generated Code Comments via Test Execution

Sungmin Kang, Louis Milliken, Shin Yoo

TL;DR

This work addresses the problem of factual inaccuracies in LLM-generated code comments by introducing document testing, a test-generation-and-execution workflow that uses LLMs to create tests from comments and then assesses their veracity through test outcomes. The authors manually labeled 540 Java method comments across three LLMs and found substantial inaccuracy, with GPT-4 yielding about 20% incorrect statements. They demonstrate that existing code-comment consistency and similarity-based baselines have little predictive power for factual correctness, while document testing shows a strong statistical relationship with truthfulness, offering a practical, interpretable signal to improve trust in automated documentation. The approach is validated on Java/Defects4J data, with two-stage prompting and test-execution pipelines enhancing effectiveness, and with clear limitations and directions for future work to generalize beyond Java and improve environment reliability.

Abstract

Software comments are critical for human understanding of software, and as such many comment generation techniques have been proposed. However, we find that a systematic evaluation of the factual accuracy of generated comments is rare; only subjective accuracy labels have been given. Evaluating comments generated by three Large Language Models (LLMs), we find that even for the best-performing LLM, roughly a fifth of its comments contained demonstrably inaccurate statements. While it seems code-comment consistency detection techniques should be able to detect inaccurate comments, we perform experiments demonstrating they have no statistically significant relationship with comment accuracy, underscoring the substantial difficulty of this problem. To tackle this, we propose the concept of document testing, in which a document is verified by using an LLM to generate tests based on the document, running those tests, and observing whether they pass or fail. Furthermore, we implement our concept to verify Java comments. Experiments demonstrate that our approach has a robust statistical relationship with comment accuracy, making headway into a problem where prior techniques failed. Qualitative evaluation also reveals the promise of our approach in gaining developer trust, while highlighting the limitations of our current implementation.

Identifying Inaccurate Descriptions in LLM-generated Code Comments via Test Execution

TL;DR

This work addresses the problem of factual inaccuracies in LLM-generated code comments by introducing document testing, a test-generation-and-execution workflow that uses LLMs to create tests from comments and then assesses their veracity through test outcomes. The authors manually labeled 540 Java method comments across three LLMs and found substantial inaccuracy, with GPT-4 yielding about 20% incorrect statements. They demonstrate that existing code-comment consistency and similarity-based baselines have little predictive power for factual correctness, while document testing shows a strong statistical relationship with truthfulness, offering a practical, interpretable signal to improve trust in automated documentation. The approach is validated on Java/Defects4J data, with two-stage prompting and test-execution pipelines enhancing effectiveness, and with clear limitations and directions for future work to generalize beyond Java and improve environment reliability.

Abstract

Software comments are critical for human understanding of software, and as such many comment generation techniques have been proposed. However, we find that a systematic evaluation of the factual accuracy of generated comments is rare; only subjective accuracy labels have been given. Evaluating comments generated by three Large Language Models (LLMs), we find that even for the best-performing LLM, roughly a fifth of its comments contained demonstrably inaccurate statements. While it seems code-comment consistency detection techniques should be able to detect inaccurate comments, we perform experiments demonstrating they have no statistically significant relationship with comment accuracy, underscoring the substantial difficulty of this problem. To tackle this, we propose the concept of document testing, in which a document is verified by using an LLM to generate tests based on the document, running those tests, and observing whether they pass or fail. Furthermore, we implement our concept to verify Java comments. Experiments demonstrate that our approach has a robust statistical relationship with comment accuracy, making headway into a problem where prior techniques failed. Qualitative evaluation also reveals the promise of our approach in gaining developer trust, while highlighting the limitations of our current implementation.
Paper Structure (27 sections, 7 equations, 11 figures, 5 tables)

This paper contains 27 sections, 7 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Comment factual accuracy by generating LLM.
  • Figure 2: Diagram of error taxonomy for GPT-3 comments.
  • Figure 3: Diagram of document testing pipeline.
  • Figure 4: Relationship between comment accuracy and suggested indicators.
  • Figure 5: ROC-AUC and AP values compared with baselines. For our approach (blue), we present the mean value from five runs, along with its 95% confidence interval.
  • ...and 6 more figures