Table of Contents
Fetching ...

Mechanistic Understanding of Language Models in Syntactic Code Completion

Samuel Miller, Daking Rai, Ziyu Yao

TL;DR

The paper addresses the opacity of internal decision processes in Code LMs during syntax-critical tasks by applying mechanistic interpretability techniques to CodeLlama-7b. It introduces a synthetic closing-parentheses dataset and uses logit lens, logit-difference, and attention-visualization to dissect layer- and head-level contributions to token prediction. The key findings show that correct token prediction emerges in middle-to-late layers, with multi-head attention generally more impactful than feed-forward sublayers, and that a small set of attention heads track parentheses counting, including one head that can mislead predictions. These insights have practical implications for improving code-generation reliability and safety by guiding circuit-discovery efforts and targeted intervention on specific attention heads.

Abstract

Recently, language models (LMs) have shown impressive proficiency in code generation tasks, especially when fine-tuned on code-specific datasets, commonly known as Code LMs. However, our understanding of the internal decision-making processes of Code LMs, such as how they use their (syntactic or semantic) knowledge, remains limited, which could lead to unintended harm as they are increasingly used in real life. This motivates us to conduct one of the first Mechanistic Interpretability works to understand how Code LMs perform a syntactic completion task, specifically the closing parenthesis task, on the CodeLlama-7b model (Roziere et al. 2023). Our findings reveal that the model requires middle-later layers until it can confidently predict the correct label for the closing parenthesis task. Additionally, we identify that while both multi-head attention (MHA) and feed-forward (FF) sub-layers play essential roles, MHA is particularly crucial. Furthermore, we also discover attention heads that keep track of the number of already closed parentheses precisely but may or may not promote a correct number of closing parentheses that are still missing, leading to a positive or negative impact on the model's performance.

Mechanistic Understanding of Language Models in Syntactic Code Completion

TL;DR

The paper addresses the opacity of internal decision processes in Code LMs during syntax-critical tasks by applying mechanistic interpretability techniques to CodeLlama-7b. It introduces a synthetic closing-parentheses dataset and uses logit lens, logit-difference, and attention-visualization to dissect layer- and head-level contributions to token prediction. The key findings show that correct token prediction emerges in middle-to-late layers, with multi-head attention generally more impactful than feed-forward sublayers, and that a small set of attention heads track parentheses counting, including one head that can mislead predictions. These insights have practical implications for improving code-generation reliability and safety by guiding circuit-discovery efforts and targeted intervention on specific attention heads.

Abstract

Recently, language models (LMs) have shown impressive proficiency in code generation tasks, especially when fine-tuned on code-specific datasets, commonly known as Code LMs. However, our understanding of the internal decision-making processes of Code LMs, such as how they use their (syntactic or semantic) knowledge, remains limited, which could lead to unintended harm as they are increasingly used in real life. This motivates us to conduct one of the first Mechanistic Interpretability works to understand how Code LMs perform a syntactic completion task, specifically the closing parenthesis task, on the CodeLlama-7b model (Roziere et al. 2023). Our findings reveal that the model requires middle-later layers until it can confidently predict the correct label for the closing parenthesis task. Additionally, we identify that while both multi-head attention (MHA) and feed-forward (FF) sub-layers play essential roles, MHA is particularly crucial. Furthermore, we also discover attention heads that keep track of the number of already closed parentheses precisely but may or may not promote a correct number of closing parentheses that are still missing, leading to a positive or negative impact on the model's performance.

Paper Structure

This paper contains 20 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Logit difference of the Code LM between the correct and the counterfactual tokens across layers of the residual stream. "$L$_pre" and "$L$_post" indicate residual-stream activations before and after layer $L$, respectively.
  • Figure 2: Sub-layer logit difference of the Code LM between the correct and counterfactual tokens contribution to the residual stream. Figures of sub-layer logit difference for other class constructors are shown in the Appendix \ref{['app: app1']}.
  • Figure 3: Logit differences between the correct and counterfactual tokens of various attention layers and heads for each sub-task. We observed that the contribution to the logit difference was dominantly made by a few heads (e.g., $L30H0$ and $L27H24$ for the Two Closing Parenthesis task).
  • Figure 4: Sub-layer logit difference of the Code LM between the correct and counterfactual tokens contribution to the residual stream. "embed" indicates the word embedding.