Table of Contents
Fetching ...

Human-Like Code Quality Evaluation through LLM-based Recursive Semantic Comprehension

Fangzhou Xu, Sai Zhang, Zhenchang Xing, Xiaowang Zhang, Yahong Han, Zhiyong Feng

TL;DR

Code quality evaluation is challenged by the limitations of match-based and execution-based metrics. HuCoSC presents a recursive semantic comprehension framework powered by a Semantic Dependency Decoupling Storage to enable independent semantic analysis of code fragments, followed by semantic comparison against a reference. Across CodeNet Code-Pair and HumanEval, HuCoSC demonstrates superior correlation with human judgments and with execution relative to baselines, and its recursive decomposition reduces LLM hallucinations on complex code. The work also introduces a Code-Pair validation dataset and shows HuCoSC’s adaptability to different backbones and evaluation settings, with practical implications for education and automatic code assessment.

Abstract

Code quality evaluation involves scoring generated code quality based on a reference code for a specific problem statement. Currently, there are two main forms of evaluating code quality: match-based evaluation and execution-based evaluation. The former requires the collection of a large number of test cases, making a huge cost. The latter relies on superficial code matching as an evaluation metric, which fails to accurately capture code semantics. Moreover, extensive research has demonstrated that match-based evaluations do not truly reflect code quality. With the development of large language models (LLMs) in recent years, studies have proven the feasibility of using LLMs as evaluators for generative tasks. However, due to issues like hallucinations and uncertainty in LLMs, their correlation with human judgment remains at a lower level, making the direct use of LLMs for code quality evaluation challenging. To address these issues, we propose Human-Like Code Quality Evaluation through LLM-based Recursive Semantic Comprehension (HuCoSC). We employ a recursive approach to enable LLMs to comprehend portions of code semantics independently each time, obtaining the code semantics through multiple interactions with LLMs. We designed a Semantic Dependency Decoupling Storage to make independent analysis feasible, allowing LLMs to achieve more accurate semantics by breaking down complex problems. Finally, the generated code is scored based on a semantic comparison between the reference code and itself. Experimental results indicate that HuCoSC surpasses existing state-of-the-art methods in terms of correlation with human experts and correlation with code execution.

Human-Like Code Quality Evaluation through LLM-based Recursive Semantic Comprehension

TL;DR

Code quality evaluation is challenged by the limitations of match-based and execution-based metrics. HuCoSC presents a recursive semantic comprehension framework powered by a Semantic Dependency Decoupling Storage to enable independent semantic analysis of code fragments, followed by semantic comparison against a reference. Across CodeNet Code-Pair and HumanEval, HuCoSC demonstrates superior correlation with human judgments and with execution relative to baselines, and its recursive decomposition reduces LLM hallucinations on complex code. The work also introduces a Code-Pair validation dataset and shows HuCoSC’s adaptability to different backbones and evaluation settings, with practical implications for education and automatic code assessment.

Abstract

Code quality evaluation involves scoring generated code quality based on a reference code for a specific problem statement. Currently, there are two main forms of evaluating code quality: match-based evaluation and execution-based evaluation. The former requires the collection of a large number of test cases, making a huge cost. The latter relies on superficial code matching as an evaluation metric, which fails to accurately capture code semantics. Moreover, extensive research has demonstrated that match-based evaluations do not truly reflect code quality. With the development of large language models (LLMs) in recent years, studies have proven the feasibility of using LLMs as evaluators for generative tasks. However, due to issues like hallucinations and uncertainty in LLMs, their correlation with human judgment remains at a lower level, making the direct use of LLMs for code quality evaluation challenging. To address these issues, we propose Human-Like Code Quality Evaluation through LLM-based Recursive Semantic Comprehension (HuCoSC). We employ a recursive approach to enable LLMs to comprehend portions of code semantics independently each time, obtaining the code semantics through multiple interactions with LLMs. We designed a Semantic Dependency Decoupling Storage to make independent analysis feasible, allowing LLMs to achieve more accurate semantics by breaking down complex problems. Finally, the generated code is scored based on a semantic comparison between the reference code and itself. Experimental results indicate that HuCoSC surpasses existing state-of-the-art methods in terms of correlation with human experts and correlation with code execution.

Paper Structure

This paper contains 47 sections, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: A comparison of scores from different code quality evaluation methods, with the scores of each method converted to a percentage scale.
  • Figure 2: A comparison of scores from different code quality evaluation methods, with the scores of each method converted to a percentage scale
  • Figure 3: Overall Framework of HuCoSC
  • Figure 4: Demonstrates how code is recursively decomposed into sub-codes.
  • Figure 5: Demonstrates how the Semantic Dependency Decoupling Storage eliminates external dependencies, as well as the process of updating its internal semantic descriptions.
  • ...and 4 more figures