Looking into Black Box Code Language Models
Muhammad Umair Haider, Umar Farooq, A. B. Siddique, Mark Marron
TL;DR
This work tackles the interpretability gap in code language models by probing the feed-forward layers, which house two-thirds of transformer parameters. Using Codegen-Mono and Polycoder across Python, Java, and Go, the authors reveal a layer-wise organization where lower FF layers capture syntax and higher layers encode abstract semantics, and demonstrate that targeted concept editing via masking can significantly affect the concept without harming overall performance. They quantify layer contributions to final outputs and show how context size influences which layers are responsible for predictions, highlighting a hierarchical information flow and a thinking-vs-output dynamic. The findings offer practical implications for debugging, updating APIs, and testing code LMs, advancing understanding beyond attention-focused analyses. Overall, the study provides a concrete, manipulable picture of how FF layers store, propagate, and integrate information in code-generation models.
Abstract
Language Models (LMs) have shown their application for tasks pertinent to code and several code~LMs have been proposed recently. The majority of the studies in this direction only focus on the improvements in performance of the LMs on different benchmarks, whereas LMs are considered black boxes. Besides this, a handful of works attempt to understand the role of attention layers in the code~LMs. Nonetheless, feed-forward layers remain under-explored which consist of two-thirds of a typical transformer model's parameters. In this work, we attempt to gain insights into the inner workings of code language models by examining the feed-forward layers. To conduct our investigations, we use two state-of-the-art code~LMs, Codegen-Mono and Ploycoder, and three widely used programming languages, Java, Go, and Python. We focus on examining the organization of stored concepts, the editability of these concepts, and the roles of different layers and input context size variations for output generation. Our empirical findings demonstrate that lower layers capture syntactic patterns while higher layers encode abstract concepts and semantics. We show concepts of interest can be edited within feed-forward layers without compromising code~LM performance. Additionally, we observe initial layers serve as ``thinking'' layers, while later layers are crucial for predicting subsequent code tokens. Furthermore, we discover earlier layers can accurately predict smaller contexts, but larger contexts need critical later layers' contributions. We anticipate these findings will facilitate better understanding, debugging, and testing of code~LMs.
