Table of Contents
Fetching ...

METAMON: Finding Inconsistencies between Program Documentation and Behavior using Metamorphic LLM Queries

Hyeonseok Lee, Gabin An, Shin Yoo

TL;DR

This work tackles the problem of code documentation diverging from actual program behavior, which is costly to verify manually. METAMON combines EvoSuite-generated regression tests to capture current behavior with metamorphic prompting and self-consistency in a large language model to assess alignment with documentation, producing a normalized consistency score. In an empirical study on 9,482 documentation–code pairs from five Defects4J projects, METAMON achieves a precision of $0.72$ and recall of $0.48$, with ablation showing that metamorphic prompts and self-consistency improve results. The approach offers a scalable avenue for automated documentation verification and can complement fault localization and automatic repair in real-world software maintenance.

Abstract

Code documentation can, if written precisely, help developers better understand the code they accompany. However, unlike code, code documentation cannot be automatically verified via execution, potentially leading to inconsistencies between documentation and the actual behavior. While such inconsistencies can be harmful for the developer's understanding of the code, checking and finding them remains a costly task due to the involvement of human engineers. This paper proposes METAMON, which uses an existing search-based test generation technique to capture the current program behavior in the form of test cases, and subsequently uses LLM-based code reasoning to identify the generated regression test oracles that are not consistent with the program specifications in the documentation. METAMON is supported in this task by metamorphic testing and self-consistency. An empirical evaluation against 9,482 pairs of code documentation and code snippets, generated using five open-source projects from Defects4J v2.0.1, shows that METAMON can classify the code-and-documentation inconsistencies with a precision of 0.72 and a recall of 0.48.

METAMON: Finding Inconsistencies between Program Documentation and Behavior using Metamorphic LLM Queries

TL;DR

This work tackles the problem of code documentation diverging from actual program behavior, which is costly to verify manually. METAMON combines EvoSuite-generated regression tests to capture current behavior with metamorphic prompting and self-consistency in a large language model to assess alignment with documentation, producing a normalized consistency score. In an empirical study on 9,482 documentation–code pairs from five Defects4J projects, METAMON achieves a precision of and recall of , with ablation showing that metamorphic prompts and self-consistency improve results. The approach offers a scalable avenue for automated documentation verification and can complement fault localization and automatic repair in real-world software maintenance.

Abstract

Code documentation can, if written precisely, help developers better understand the code they accompany. However, unlike code, code documentation cannot be automatically verified via execution, potentially leading to inconsistencies between documentation and the actual behavior. While such inconsistencies can be harmful for the developer's understanding of the code, checking and finding them remains a costly task due to the involvement of human engineers. This paper proposes METAMON, which uses an existing search-based test generation technique to capture the current program behavior in the form of test cases, and subsequently uses LLM-based code reasoning to identify the generated regression test oracles that are not consistent with the program specifications in the documentation. METAMON is supported in this task by metamorphic testing and self-consistency. An empirical evaluation against 9,482 pairs of code documentation and code snippets, generated using five open-source projects from Defects4J v2.0.1, shows that METAMON can classify the code-and-documentation inconsistencies with a precision of 0.72 and a recall of 0.48.

Paper Structure

This paper contains 28 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An example of a buggy source code along with its corresponding Javadoc and EvoSuite-generated regression test case
  • Figure 2: An example of metamorphic prompt
  • Figure 3: Incorrect oracle detection of Metamon
  • Figure 4: An impact of metamorphic relations, self-consistency, and labels
  • Figure 5: An example of Lack of Specification Detail
  • ...and 2 more figures