Table of Contents
Fetching ...

Assessing Code Understanding in LLMs

Cosimo Laneve, Alvise Spanò, Dalila Ressi, Sabina Rossi, Michele Bugliesi

TL;DR

This paper assesses Large Language Models on their ability to recognize semantic equivalence in Python code after copy propagation and constant folding, two classic compiler optimizations. By formalizing these transformations and constructing a benchmark of correct and perturbed variants, the study reveals substantial zero-shot and few-shot deficits in code understanding, with contextual prompts offering only partial improvement. The findings advocate integrating LLMs with automatic code-optimization tools to enable self-supervised training and robust pre-processing, aiming to enhance semantic reasoning over code. The work highlights practical implications for deploying LLMs in programming tasks and outlines concrete avenues for future dataset expansion and model-algorithm interactions.

Abstract

We present an empirical evaluation of Large Language Models in code understanding associated with non-trivial, semantic-preserving program transformations such as copy propagation or constant folding. Our findings show that LLMs fail to judge semantic equivalence in approximately 41\% of cases when no context is provided and in 29\% when given a simple generic context. To improve accuracy, we advocate integrating LLMs with code-optimization tools to enhance training and facilitate more robust program understanding.

Assessing Code Understanding in LLMs

TL;DR

This paper assesses Large Language Models on their ability to recognize semantic equivalence in Python code after copy propagation and constant folding, two classic compiler optimizations. By formalizing these transformations and constructing a benchmark of correct and perturbed variants, the study reveals substantial zero-shot and few-shot deficits in code understanding, with contextual prompts offering only partial improvement. The findings advocate integrating LLMs with automatic code-optimization tools to enable self-supervised training and robust pre-processing, aiming to enhance semantic reasoning over code. The work highlights practical implications for deploying LLMs in programming tasks and outlines concrete avenues for future dataset expansion and model-algorithm interactions.

Abstract

We present an empirical evaluation of Large Language Models in code understanding associated with non-trivial, semantic-preserving program transformations such as copy propagation or constant folding. Our findings show that LLMs fail to judge semantic equivalence in approximately 41\% of cases when no context is provided and in 29\% when given a simple generic context. To improve accuracy, we advocate integrating LLMs with code-optimization tools to enhance training and facilitate more robust program understanding.

Paper Structure

This paper contains 17 sections, 3 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Annotations and transformations of copy propagation.
  • Figure 2: Annotations and transformations of constant folding
  • Figure 3: Difference between the preamble of the contextless and contextual prompts.