Table of Contents
Fetching ...

Code Simulation Challenges for Large Language Models

Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge

TL;DR

The paper probes how well large language models can simulate code execution to reveal algorithmic reasoning limits, introducing six benchmarks that span straight-line, critical-path, approximate, redundant, nested-loop, and sorting tasks. It analyzes model performance across multiple architectures and prompts, revealing fragility and memorisation as core bottlenecks. A novel prompting method, Chain of Simulation (CoSm), is proposed to elicit line-by-line execution traces and reduce reliance on pattern matching, improving simulation in many cases. The findings highlight both the capabilities and the gaps of current LLMs as digital computational models and propose CoSm as a practical step toward more reliable routine simulation. Overall, the work advances benchmarking for bare-bones LLMs and presents CoSm as a transferable technique with potential impact on broader algorithmic reasoning tasks.

Abstract

Many reasoning, planning, and problem-solving tasks share an intrinsic algorithmic nature: correctly simulating each step is a sufficient condition to solve them correctly. This work studies to what extent Large Language Models (LLMs) can simulate coding and algorithmic tasks to provide insights into general capabilities in such algorithmic reasoning tasks. We introduce benchmarks for straight-line programs, code that contains critical paths, and approximate and redundant instructions. We further assess the simulation capabilities of LLMs with sorting algorithms and nested loops and show that a routine's computational complexity directly affects an LLM's ability to simulate its execution. While the most powerful LLMs exhibit relatively strong simulation capabilities, the process is fragile, seems to rely heavily on pattern recognition, and is affected by memorisation. We propose a novel off-the-shelf prompting method, Chain of Simulation (CoSm), which instructs LLMs to simulate code execution line by line/follow the computation pattern of compilers. CoSm efficiently helps LLMs reduce memorisation and shallow pattern recognition while improving simulation performance. We consider the success of CoSm in code simulation to be inspirational for other general routine simulation reasoning tasks.

Code Simulation Challenges for Large Language Models

TL;DR

The paper probes how well large language models can simulate code execution to reveal algorithmic reasoning limits, introducing six benchmarks that span straight-line, critical-path, approximate, redundant, nested-loop, and sorting tasks. It analyzes model performance across multiple architectures and prompts, revealing fragility and memorisation as core bottlenecks. A novel prompting method, Chain of Simulation (CoSm), is proposed to elicit line-by-line execution traces and reduce reliance on pattern matching, improving simulation in many cases. The findings highlight both the capabilities and the gaps of current LLMs as digital computational models and propose CoSm as a practical step toward more reliable routine simulation. Overall, the work advances benchmarking for bare-bones LLMs and presents CoSm as a transferable technique with potential impact on broader algorithmic reasoning tasks.

Abstract

Many reasoning, planning, and problem-solving tasks share an intrinsic algorithmic nature: correctly simulating each step is a sufficient condition to solve them correctly. This work studies to what extent Large Language Models (LLMs) can simulate coding and algorithmic tasks to provide insights into general capabilities in such algorithmic reasoning tasks. We introduce benchmarks for straight-line programs, code that contains critical paths, and approximate and redundant instructions. We further assess the simulation capabilities of LLMs with sorting algorithms and nested loops and show that a routine's computational complexity directly affects an LLM's ability to simulate its execution. While the most powerful LLMs exhibit relatively strong simulation capabilities, the process is fragile, seems to rely heavily on pattern recognition, and is affected by memorisation. We propose a novel off-the-shelf prompting method, Chain of Simulation (CoSm), which instructs LLMs to simulate code execution line by line/follow the computation pattern of compilers. CoSm efficiently helps LLMs reduce memorisation and shallow pattern recognition while improving simulation performance. We consider the success of CoSm in code simulation to be inspirational for other general routine simulation reasoning tasks.
Paper Structure (38 sections, 22 figures, 1 table)

This paper contains 38 sections, 22 figures, 1 table.

Figures (22)

  • Figure 1: Left: an example of the naturalistic vs. synthetic good exchange settings. The former describes, in natural language, two agents who exchange goods; the latter is an equivalent formulation in code. GPT-3.5-Turbo performances on the tasks are correlated (right), but it performs better on the synthetic task (a "simulation gap"). We conduct experiments on $30$ samples per instruction class with $\{10, 20, 30, 40, 50\}$ interactions/ lines of code.
  • Figure 2: Accuracy on $3$ independent runs of $30$ experiments each of different LLMs on code snippets with solely {and, or}, {add, sub} or {mov} instructions. We group results by codes of varying number of instructions (x-axis), namely $\{1, 10, 30\}$.
  • Figure 3: Accuracy and Mean Absolute Error of different LLMs on code of varying length with only {add,sub} and {mov} instructions (out of $3$ independent runs of $30$ experiments each).
  • Figure 4: Straight-line code.
  • Figure 5: Code with critical path.
  • ...and 17 more figures