Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

Marcus J. Min; Yangruibo Ding; Luca Buratti; Saurabh Pujar; Gail Kaiser; Suman Jana; Baishakhi Ray

Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

Marcus J. Min, Yangruibo Ding, Luca Buratti, Saurabh Pujar, Gail Kaiser, Suman Jana, Baishakhi Ray

TL;DR

<3-5 sentence high-level summary> IdentityChain introduces a formal framework to evaluate self-consistency in Code LLMs by chaining NL-to-PL and PL-to-NL generations and measuring semantic identity across steps. It defines a semantics-based notion of self-consistency, proposes the Test Output Match (TOM) score for effective PL-to-NL evaluation, and uses greedy decoding for efficient inference. Experiments across eleven Code LLMs show self-consistency degrades with longer chains and is not strictly correlated with conventional accuracy, highlighting a distinct, crucial evaluation dimension. IdentityChain also serves as a debugging tool, uncovering weaknesses in data types, implicit semantics, and execution prediction, with TOM supporting holistic assessment of NL-to-PL and PL-to-NL tasks.</paper_summary>

Abstract

Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates the performance of Code LLMs on a set of individual tasks, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications for its own code and generating code for its own specifications. Failure to preserve self-consistency reveals a lack of understanding of the shared semantics underlying natural language and programming language, and therefore undermines the trustworthiness of a model. In this paper, we first formally define the self-consistency of Code LLMs and then design a framework, IdentityChain, which effectively and efficiently evaluates the self-consistency and conventional accuracy of a model at the same time. We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from conventional accuracy. Furthermore, we show that IdentityChain can be used as a model debugging tool to expose weaknesses of Code LLMs by demonstrating three major weaknesses that we identify in current models using IdentityChain. Our code is available at https://github.com/marcusm117/IdentityChain.

Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

TL;DR

Abstract

Paper Structure (21 sections, 7 equations, 8 figures, 2 tables)

This paper contains 21 sections, 7 equations, 8 figures, 2 tables.

Introduction
Related Work
Formalization
Self-Consistency Definition
Self-Consistency Evaluation
The IdentityChain Framework
Effective Self-Consistency Evaluation
Efficient Self-Consistency Evaluation
Holistic Evaluation of Code LLMs
Experiments
Results
Self-Consistency of Code LLMs
Effectiveness of TOM score
Efficiency of Greedy Decoding
IdentityChain As a Model Debugging Tool
...and 6 more sections

Figures (8)

Figure 1: The IdentityChain Framework. Starting from a docstring $nl_0$, instruct the model to generate a program $pl_0$, summarize $pl_0$ into a new docstring $nl_1$, and generate a new program $pl_1$. If the test outputs of $pl_1$ do not match the ones of $pl_0$, then the model is not self-consistent. This chain can be extended to length $n \in \mathbb{N}$ and we compute whether, for all $i<n$, the test outputs of $pl_{i}$ match the ones of $pl_{i+1}$, returning a binary result that indicates if the model is self-consistent regarding $nl_0$.
Figure 2: SSC$_i$ and SC$_i$ at Computed Each Step $i$.
Figure 3: SC$_5$ Evaluated at Different Temperatures.
Figure 4: Replacing Meaningful Function Names with A Generic "func". Given the docstring with the original function name, GPT-3.5 generates an incorrect program that conflicts with the function name. When further summarizing that program along with the original function name, GPT-3.5 completely ignores the code and generates a new docstring based on the function name. In this case, we will falsely conclude that GPT-3.5 is not self-consistent. However, when summarizing the program along with a generic name "func" in replacement, GPT-3.5 correctly captures the code semantics and thus is self-consistent w.r.t. the original docstring. Therefore, when generating $nl_i$ and $pl_i$ for $i \geq 1$, we replace the original meaningful function name with the generic "func".
Figure 5: SSC$_5$ Evaluated at Different Temperatures. Similar to the SC$_5$ results in Section \ref{['subsec:greedy']}, for the strong self-consistency score SSC$_5$, the relative rankings of models mostly remain regardless of temperature i.e. more strong self-consistent models are always more strong self-consistent no matter the temperature, which confirms that greedy results are generalizable to different temperatures.
...and 3 more figures

Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

TL;DR

Abstract

Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

Authors

TL;DR

Abstract

Table of Contents

Figures (8)