Toward a Theory of Causation for Interpreting Neural Code Models
David N. Palacio, Alejandro Velasco, Nathan Cooper, Alvaro Rodriguez, Kevin Moran, Denys Poshyvanyk
TL;DR
The paper tackles the interpretability gap in Neural Code Models (NCMs) by proposing a causal post hoc framework, $do_{code}$, to explain code predictions beyond traditional accuracy metrics. It grounds explanations in programming-language properties with a Structural Causal Model (SCM) and adopts Pearl's Ladder of Causation to define interventions and estimands such as $ATE$ and $p(Y|do(T))$. The four-step pipeline (model SCM, identify estimand, estimate causal effects, validate the causal process) is demonstrated through seven case studies across RNNs, GRUs, GPT-2, and a BERT-like model, with a replication package icodegen. Key findings show that many observed correlations are confounded, but certain interventions (e.g., token masking) reveal meaningful causal effects, enabling bias detection and more trustworthy NCM dev tools.
Abstract
Neural Language Models of Code, or Neural Code Models (NCMs), are rapidly progressing from research prototypes to commercial developer tools. As such, understanding the capabilities and limitations of such models is becoming critical. However, the abilities of these models are typically measured using automated metrics that often only reveal a portion of their real-world performance. While, in general, the performance of NCMs appears promising, currently much is unknown about how such models arrive at decisions. To this end, this paper introduces $do_{code}$, a post hoc interpretability method specific to NCMs that is capable of explaining model predictions. $do_{code}$ is based upon causal inference to enable programming language-oriented explanations. While the theoretical underpinnings of $do_{code}$ are extensible to exploring different model properties, we provide a concrete instantiation that aims to mitigate the impact of spurious correlations by grounding explanations of model behavior in properties of programming languages. To demonstrate the practical benefit of $do_{code}$, we illustrate the insights that our framework can provide by performing a case study on two popular deep learning architectures and ten NCMs. The results of this case study illustrate that our studied NCMs are sensitive to changes in code syntax. All our NCMs, except for the BERT-like model, statistically learn to predict tokens related to blocks of code (\eg brackets, parenthesis, semicolon) with less confounding bias as compared to other programming language constructs. These insights demonstrate the potential of $do_{code}$ as a useful method to detect and facilitate the elimination of confounding bias in NCMs.
