CodeSCM: Causal Analysis for Multi-Modal Code Generation
Mukur Gupta, Noopur Bhatt, Suman Jana
TL;DR
CodeSCM introduces a Structural Causal Model to dissect how multi-modal prompts influence code generation by large language models. By embedding latent mediators for code and natural language semantics and applying interventions on prompt modalities, the framework quantifies total and direct effects (TE and DE) of NL, Code$_{AL}$, Code$_{NL}$, and I/O components on code correctness $Y$. The empirical results reveal that input-output examples and natural-language code components notably shape generated code, with semantics-preserving perturbations able to alter accuracy and reduce hallucinations in some cases; multi-modal pretraining further aligns modality embeddings in code-focused models. This causal analysis offers interpretable guidance for prompt design and illuminates how memory and pretraining shape multi-modal code generation, informing future improvements in codeLLMs and evaluation protocols.
Abstract
In this paper, we propose CodeSCM, a Structural Causal Model (SCM) for analyzing multi-modal code generation using large language models (LLMs). By applying interventions to CodeSCM, we measure the causal effects of different prompt modalities, such as natural language, code, and input-output examples, on the model. CodeSCM introduces latent mediator variables to separate the code and natural language semantics of a multi-modal code generation prompt. Using the principles of Causal Mediation Analysis on these mediators we quantify direct effects representing the model's spurious leanings. We find that, in addition to natural language instructions, input-output examples significantly influence code generation.
