Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach

Zhenlan Ji; Pingchuan Ma; Zongjie Li; Shuai Wang

Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach

Zhenlan Ji, Pingchuan Ma, Zongjie Li, Shuai Wang

TL;DR

The paper tackles the variability of LLM based code generation caused by natural language prompts. It introduces a causality centric framework that builds a prompt to code causal graph from linguistic features and code metrics, enabling both explanation and optimization of prompts using $ATE$ based causal inference and the DiBS discovery method. Across GPT-Neo, GPT-3.5-Turbo, and GPT-4 on the APPS Python dataset, the authors reveal model dependent prompt effects and identify mediating linguistic features, demonstrating that prompt design can be systematically improved and even optimized via a downstream genetic algorithm. The results provide actionable guidance for prompt engineering and highlight a principled approach to calibrate prompts to maximize code quality in real world LLM usage.

Abstract

While code generation has been widely used in various software development scenarios, the quality of the generated code is not guaranteed. This has been a particular concern in the era of large language models (LLMs)- based code generation, where LLMs, deemed a complex and powerful black-box model, is instructed by a high-level natural language specification, namely a prompt, to generate code. Nevertheless, effectively evaluating and explaining the code generation capability of LLMs is inherently challenging, given the complexity of LLMs and the lack of transparency. Inspired by the recent progress in causality analysis and its application in software engineering, this paper launches a causality analysis-based approach to systematically analyze the causal relations between the LLM input prompts and the generated code. To handle various technical challenges in this study, we first propose a novel causal graph-based representation of the prompt and the generated code, which is established over the fine-grained, human-understandable concepts in the input prompts. The formed causal graph is then used to identify the causal relations between the prompt and the derived code. We illustrate the insights that our framework can provide by studying over 3 popular LLMs with over 12 prompt adjustment strategies. The results of these studies illustrate the potential of our technique to provide insights into LLM effectiveness, and aid end-users in understanding predictions. Additionally, we demonstrate that our approach provides actionable insights to improve the quality of the LLM-generated code by properly calibrating the prompt.

Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach

TL;DR

based causal inference and the DiBS discovery method. Across GPT-Neo, GPT-3.5-Turbo, and GPT-4 on the APPS Python dataset, the authors reveal model dependent prompt effects and identify mediating linguistic features, demonstrating that prompt design can be systematically improved and even optimized via a downstream genetic algorithm. The results provide actionable guidance for prompt engineering and highlight a principled approach to calibrate prompts to maximize code quality in real world LLM usage.

Abstract

Paper Structure (24 sections, 1 equation, 8 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 1 equation, 8 figures, 5 tables, 1 algorithm.

Introduction
Preliminary and Motivation
LLM and Prompt Engineering
Code Generation
Causality Analysis
Research Motivation and Pilot Study
Effect of Prompt in Code Generation
Prompt Adjustment via LLM-Based Rephrasing
Analysis for Complex Relationships in Code Generation
Design
Prompt Quantification
Rephrase Generation
Causal Analysis
Establishing Prompt Effect on Code Generation
Experiment Setup
...and 9 more sections

Figures (8)

Figure 1: Motivating example of prompt engineering for code generation. Red boxes indicate the error in the generated code.
Figure 2: Illustration of the difference between correlation and causation.
Figure 3: Study overview.
Figure 4: Meta-prompt design. Red text indicates the programming question that is filled in by the user, and blue text indicates the pre-defined rephrasing intention that is selected by the user.
Figure 5: Two-step causal discovery.
...and 3 more figures

Theorems & Definitions (4)

Definition 1: Causal Graph
Definition 2: Endogenous and Exogenous Nodes
Definition 3: Global Markov Assumption
Definition 4: ATE

Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach

TL;DR

Abstract

Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (4)