Assessing Code Understanding in LLMs
Cosimo Laneve, Alvise Spanò, Dalila Ressi, Sabina Rossi, Michele Bugliesi
TL;DR
This paper assesses Large Language Models on their ability to recognize semantic equivalence in Python code after copy propagation and constant folding, two classic compiler optimizations. By formalizing these transformations and constructing a benchmark of correct and perturbed variants, the study reveals substantial zero-shot and few-shot deficits in code understanding, with contextual prompts offering only partial improvement. The findings advocate integrating LLMs with automatic code-optimization tools to enable self-supervised training and robust pre-processing, aiming to enhance semantic reasoning over code. The work highlights practical implications for deploying LLMs in programming tasks and outlines concrete avenues for future dataset expansion and model-algorithm interactions.
Abstract
We present an empirical evaluation of Large Language Models in code understanding associated with non-trivial, semantic-preserving program transformations such as copy propagation or constant folding. Our findings show that LLMs fail to judge semantic equivalence in approximately 41\% of cases when no context is provided and in 29\% when given a simple generic context. To improve accuracy, we advocate integrating LLMs with code-optimization tools to enhance training and facilitate more robust program understanding.
