Table of Contents
Fetching ...

Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

Ziyu Li, Donghwan Shin

TL;DR

This work introduces Mutation-based Consistency Testing (MCT) to evaluate LLMs' code understanding by injecting semantic mutations into code and assessing whether the model correctly detects inconsistencies with natural-language descriptions. A structured prompting protocol and a mutation-generation pipeline generate diverse inconsistent pairs from the HumanEval-X benchmark, enabling cross-language and cross-operator analysis. In a case study, GPT-4 substantially outperforms GPT-3.5 across mutation operators and languages, while one-shot prompting markedly enhances GPT-3.5’s performance, revealing practical guidance for deploying LLMs in code analysis tasks. The approach provides a replicable, fine-grained framework for probing code semantics and informs future LLM development and evaluation in software engineering contexts.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages, which have enabled various applications in software engineering, such as requirement engineering, code generation, and software testing. However, existing code generation benchmarks do not necessarily assess the code understanding performance of LLMs, especially for the subtle inconsistencies that may arise between code and its semantics described in natural language. In this paper, we propose a novel method to systematically assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions, by introducing code mutations to existing code generation datasets. Code mutations are small changes that alter the semantics of the original code, creating a mismatch with the natural language description. We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs. We then use these pairs to test the ability of LLMs to correctly detect the inconsistencies. We propose a new LLM testing method, called Mutation-based Consistency Testing (MCT), and conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X, which consists of six programming languages (Python, C++, Java, Go, JavaScript, and Rust). We compare the performance of the LLMs across different types of code mutations and programming languages and analyze the results. We find that the LLMs show significant variation in their code understanding performance and that they have different strengths and weaknesses depending on the mutation type and language.

Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

TL;DR

This work introduces Mutation-based Consistency Testing (MCT) to evaluate LLMs' code understanding by injecting semantic mutations into code and assessing whether the model correctly detects inconsistencies with natural-language descriptions. A structured prompting protocol and a mutation-generation pipeline generate diverse inconsistent pairs from the HumanEval-X benchmark, enabling cross-language and cross-operator analysis. In a case study, GPT-4 substantially outperforms GPT-3.5 across mutation operators and languages, while one-shot prompting markedly enhances GPT-3.5’s performance, revealing practical guidance for deploying LLMs in code analysis tasks. The approach provides a replicable, fine-grained framework for probing code semantics and informs future LLM development and evaluation in software engineering contexts.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages, which have enabled various applications in software engineering, such as requirement engineering, code generation, and software testing. However, existing code generation benchmarks do not necessarily assess the code understanding performance of LLMs, especially for the subtle inconsistencies that may arise between code and its semantics described in natural language. In this paper, we propose a novel method to systematically assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions, by introducing code mutations to existing code generation datasets. Code mutations are small changes that alter the semantics of the original code, creating a mismatch with the natural language description. We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs. We then use these pairs to test the ability of LLMs to correctly detect the inconsistencies. We propose a new LLM testing method, called Mutation-based Consistency Testing (MCT), and conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X, which consists of six programming languages (Python, C++, Java, Go, JavaScript, and Rust). We compare the performance of the LLMs across different types of code mutations and programming languages and analyze the results. We find that the LLMs show significant variation in their code understanding performance and that they have different strengths and weaknesses depending on the mutation type and language.
Paper Structure (28 sections, 5 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: An example problem, canonical solution, and test inputs in HumanEval chen2021evaluating
  • Figure 2: A simplified mutant generation example. The logical operator 'and' in the original program (left) has been modified to 'or' in the mutant (right).
  • Figure 3: Zero-shot prompt template with {DESCRIPTION} and {CODE} as placeholders for corresponding artifacts
  • Figure 4: The additional input for one-shot prompts. This part is appended at the end of the zero-shot prompt template (Figure \ref{['fig:prompt']}) to create one-shot prompts.
  • Figure 5: Model Decision Tree