Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates

Ashish Hooda; Mihai Christodorescu; Miltiadis Allamanis; Aaron Wilson; Kassem Fawaz; Somesh Jha

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates

Ashish Hooda, Mihai Christodorescu, Miltiadis Allamanis, Aaron Wilson, Kassem Fawaz, Somesh Jha

TL;DR

This work addresses whether large code models truly understand core programming concepts by introducing Counterfactual Analysis for Programming Concept Predicates (CACP), a black-box framework that generates minimal, concept-specific mutations to test four PCPs: control flow, data flow, data types, and identifier naming. By applying CACP to code completion across benchmarks like HumanEval, MBPP, and CodeContests with ten popular models, the authors quantify understanding via the Average Mutation Effect (AME), finding substantial gaps (up to $34\%$ AME) especially for control-flow and data-flow predicates. The study further shows that model size and code-focused fine-tuning improve PCP understanding, though correlations across mutations are generally weak, underscoring the need for targeted data and training strategies. Overall, CACP provides a scalable, interpretable method to attribute failures to specific programming concepts and to guide robustness improvements in code-generation models.

Abstract

Large Language Models' success on text generation has also made them better at code generation and coding tasks. While a lot of work has demonstrated their remarkable performance on tasks such as code completion and editing, it is still unclear as to why. We help bridge this gap by exploring to what degree auto-regressive models understand the logical constructs of the underlying programs. We propose Counterfactual Analysis for Programming Concept Predicates (CACP) as a counterfactual testing framework to evaluate whether Large Code Models understand programming concepts. With only black-box access to the model, we use CACP to evaluate ten popular Large Code Models for four different programming concepts. Our findings suggest that current models lack understanding of concepts such as data flow and control flow.

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates

TL;DR

AME) especially for control-flow and data-flow predicates. The study further shows that model size and code-focused fine-tuning improve PCP understanding, though correlations across mutations are generally weak, underscoring the need for targeted data and training strategies. Overall, CACP provides a scalable, interpretable method to attribute failures to specific programming concepts and to guide robustness improvements in code-generation models.

Abstract

Paper Structure (19 sections, 3 equations, 4 figures, 5 tables)

This paper contains 19 sections, 3 equations, 4 figures, 5 tables.

Introduction
Background and Related Work
Programming Concept Predicates and LLMs for Code.
Robustness of Code Models.
Counterfactual Analysis.
Counterfactual Analysis for Programming Concept Predicates
Notation
Requirements
Mutations for Counterfactual Programs
Measuring Counterfactual Effect
for Code Completion
Large Language Models for Code Completion
Counterfactual Generation
Effect Measurement
Experiments
...and 4 more sections

Figures (4)

Figure 1: In this example the counterfactual input is generated by negating the relational expression in the statement. Starcoder li2023starcoder generates an incorrect completion for the input on the right. This suggests that LLMs have incomplete understanding of programming concepts such as control-flow.
Figure 2: The counterfactual generation pipeline of . It consists of two stages. First, the reference solution for the problem is perturbed using predicate-specific mutations. Second, both the original and the perturbed solution are cut at the same location to generate a pair of counterfactual inputs.
Figure 3: $\mathsf{AME}$ as a function of model size (number of parameters in Billions). The different model classes are depicted using different colors.
Figure 4: Correlation between $\mathsf{AME}$ values across pairs of mutations. The number of samples used to compute each value depends on the size of the intersection of the two mutation types. Independent-Swap: SWAP, IfElse-Flip: IFFP, Variable Names Random: RAND, Variable Names Shuffle: SHUF

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates

TL;DR

Abstract

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates

Authors

TL;DR

Abstract

Table of Contents

Figures (4)