A Critical Study of What Code-LLMs (Do Not) Learn

Abhinav Anand; Shweta Verma; Krishna Narasimhan; Mira Mezini

A Critical Study of What Code-LLMs (Do Not) Learn

Abhinav Anand, Shweta Verma, Krishna Narasimhan, Mira Mezini

TL;DR

The paper critically examines how code-LLMs encode code properties, arguing that attention maps and hidden representations largely miss cross-relations between syntactic tokens and identifiers, which are essential for program flow. Using AST and data-flow graphs, along with DirectProbe, it shows that large, fine-tuned models encode these relations poorly compared with smaller pre-trained models, suggesting memorization and shortcuts rather than genuine code understanding. The study reveals non-linear encoding in hidden representations and warns that prior interpretability methods, which rely on fixed thresholds or linear probes, can mislead conclusions. These findings motivate new training objectives and architectures beyond mere scaling to robustly capture code structure and semantics. The work also emphasizes careful experimental design for interpretability in cLLMs and NL-PL alignment considerations for future research.

Abstract

Large Language Models trained on code corpora (code-LLMs) have demonstrated impressive performance in various coding assistance tasks. However, despite their increased size and training dataset, code-LLMs still have limitations such as suggesting codes with syntactic errors, variable misuse etc. Some studies argue that code-LLMs perform well on coding tasks because they use self-attention and hidden representations to encode relations among input tokens. However, previous works have not studied what code properties are not encoded by code-LLMs. In this paper, we conduct a fine-grained analysis of attention maps and hidden representations of code-LLMs. Our study indicates that code-LLMs only encode relations among specific subsets of input tokens. Specifically, by categorizing input tokens into syntactic tokens and identifiers, we found that models encode relations among syntactic tokens and among identifiers, but they fail to encode relations between syntactic tokens and identifiers. We also found that fine-tuned models encode these relations poorly compared to their pre-trained counterparts. Additionally, larger models with billions of parameters encode significantly less information about code than models with only a few hundred million parameters.

A Critical Study of What Code-LLMs (Do Not) Learn

TL;DR

Abstract

Paper Structure (26 sections, 2 equations, 11 figures, 10 tables)

This paper contains 26 sections, 2 equations, 11 figures, 10 tables.

Introduction
Related Work
Experiments
Models and Dataset
Attention Analysis
Setup
Analysis
Analysis of Hidden Representations
Qualitative Analysis with t-SNE
Probing on Hidden Representations
Results
Attention Analysis
Analysis of Hidden Representation
Discussion
Limitations of cLLMs
...and 11 more sections

Figures (11)

Figure 1: Attention map for head with best precision (head 1) (top) and head with best f-score (head 2) (bottom) of layer 9 of CodeBERT for first 30 tokens of a python code (see Figure \ref{['fig: code']} for code). The head with best precision mostly encodes next-token attention, while head with best f-score encodes more complex relation.
Figure 2: On comparing model graph with syntax graph with an attention threshold of 0.3, the precision (left) is high but the recall is very low (right).
Figure 3: The plot illustrates F-score between model graph and syntax graph at different thresholds for all heads. Each curve in a plot represents one head. The plots for layer 6 and layer 12 of CodeBERT and CodeT5 are shown out of various models and layers evaluated at different thresholds. For most heads, F-score is highest at a threshold of 0.05 for all models.
Figure 4: Recall of model graphs with syntax graphs (top) and data flow graphs (bottom). The plots show irrespective of training-objectives, fine-tuning or larger sizes, the models do not encode more than 40% of syntactic relations and around 55% of data flow relations. Enc-Dec models encode syntactic relations much better in deeper layers.
Figure 5: Graph edit distance (GED) per node (lower value show higher similarity) of model graph from DFG, non-identifier syntax graph and complete syntax graph for various models. The gap between non-identifier and complete syntax graph shows that on introducing syntax-identifier edges the similarity reduces drastically and thus, these edges are not present in the model graphs. For very large models (center), even DFG edges are encoded poorly.
...and 6 more figures

A Critical Study of What Code-LLMs (Do Not) Learn

TL;DR

Abstract

A Critical Study of What Code-LLMs (Do Not) Learn

Authors

TL;DR

Abstract

Table of Contents

Figures (11)