Table of Contents
Fetching ...

CodeSCM: Causal Analysis for Multi-Modal Code Generation

Mukur Gupta, Noopur Bhatt, Suman Jana

TL;DR

CodeSCM introduces a Structural Causal Model to dissect how multi-modal prompts influence code generation by large language models. By embedding latent mediators for code and natural language semantics and applying interventions on prompt modalities, the framework quantifies total and direct effects (TE and DE) of NL, Code$_{AL}$, Code$_{NL}$, and I/O components on code correctness $Y$. The empirical results reveal that input-output examples and natural-language code components notably shape generated code, with semantics-preserving perturbations able to alter accuracy and reduce hallucinations in some cases; multi-modal pretraining further aligns modality embeddings in code-focused models. This causal analysis offers interpretable guidance for prompt design and illuminates how memory and pretraining shape multi-modal code generation, informing future improvements in codeLLMs and evaluation protocols.

Abstract

In this paper, we propose CodeSCM, a Structural Causal Model (SCM) for analyzing multi-modal code generation using large language models (LLMs). By applying interventions to CodeSCM, we measure the causal effects of different prompt modalities, such as natural language, code, and input-output examples, on the model. CodeSCM introduces latent mediator variables to separate the code and natural language semantics of a multi-modal code generation prompt. Using the principles of Causal Mediation Analysis on these mediators we quantify direct effects representing the model's spurious leanings. We find that, in addition to natural language instructions, input-output examples significantly influence code generation.

CodeSCM: Causal Analysis for Multi-Modal Code Generation

TL;DR

CodeSCM introduces a Structural Causal Model to dissect how multi-modal prompts influence code generation by large language models. By embedding latent mediators for code and natural language semantics and applying interventions on prompt modalities, the framework quantifies total and direct effects (TE and DE) of NL, Code, Code, and I/O components on code correctness . The empirical results reveal that input-output examples and natural-language code components notably shape generated code, with semantics-preserving perturbations able to alter accuracy and reduce hallucinations in some cases; multi-modal pretraining further aligns modality embeddings in code-focused models. This causal analysis offers interpretable guidance for prompt design and illuminates how memory and pretraining shape multi-modal code generation, informing future improvements in codeLLMs and evaluation protocols.

Abstract

In this paper, we propose CodeSCM, a Structural Causal Model (SCM) for analyzing multi-modal code generation using large language models (LLMs). By applying interventions to CodeSCM, we measure the causal effects of different prompt modalities, such as natural language, code, and input-output examples, on the model. CodeSCM introduces latent mediator variables to separate the code and natural language semantics of a multi-modal code generation prompt. Using the principles of Causal Mediation Analysis on these mediators we quantify direct effects representing the model's spurious leanings. We find that, in addition to natural language instructions, input-output examples significantly influence code generation.

Paper Structure

This paper contains 45 sections, 15 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: CodeSCM causal graph representing the total and direct effects of the modal variable nodes on the response variable $Y$ representing the correctness of the generated code. $Code_{AL}$ represents algorithmic channel of code and $Code_{NL}$ is natural language channel of code.
  • Figure 2: (a) Modalities in an example from mMBPP+ dataset, red for NL, blue for $Code_{AL}$ and $Code_{NL}$, and green for Input/Output examples. (b) semantics preserving transformations: red for natural language, blue for $Code_{NL}$, orange for $Code_{AL}$, and green for I/O examples.
  • Figure 3: The original HumanEval+ prompt (top) includes the function header intersperse, followed by natural language instructions and input-output pairs for code generation. The first modification (middle) removes the algorithmic code channel by eliminating all code components while retaining a natural language description of the function header in the docstring. The second modification (bottom) removes the natural language channel by standardizing the function name.
  • Figure 4: (a) and (b) Embedding PCA projections of modalities in input prompt by CodeLLaMa and LLaMa-2. (c) and (d) Prompt embedding projections along with the ground-truth code embedding projections by CodeLLaMa and LLaMa-2. $Code_{AL}$ and $Code_{NL}$ is combined into function_signature.
  • Figure 5: Left figure shows a CoderEval-SCJ prompt where dead code insertion corrects the original prompt's error of creating a hallucinated Java class (red box). The top right figure illustrates an mMBPP+ prompt where I/O pair transformations lead to a semantic error in lines 15-16. The bottom right figure shows GPT-4T’s memorization of a HumanEval+ prompt.

Theorems & Definitions (2)

  • Definition 2.1
  • Definition 2.2