Table of Contents
Fetching ...

What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces

Jordi Armengol-Estapé, Quentin Carbonneaux, Tianjun Zhang, Aram H. Markosyan, Volker Seeker, Chris Cummins, Melanie Kambadur, Michael F. P. O'Boyle, Sida Wang, Gabriel Synnaeve, Hugh James Leather

TL;DR

The paper tackles the gap that Code LLMs treat code as static text and rarely leverage execution traces. It introduces Execution Tuning (ET), a training framework that explicitly models real program traces across multiple granularities using a Python tracer and synthetic inputs, enabling scalable trace datasets. It compares three scratchpad strategies (Scratchpad, Compact Scratchpad, Dynamic Scratchpad) and demonstrates that trace-based training improves output prediction on CruxEval and MBPP, with dynamic scratchpads delivering the strongest gains on very long executions. The findings indicate that dynamic scratchpads are particularly beneficial for long-running or complex executions, while downstream gains on traditional coding benchmarks are modest, suggesting ET’s main impact may lie in program-state understanding and debugging-oriented tasks. The work sets the stage for broader trace-based reasoning across languages and dynamic execution contexts.

Abstract

Code generation and understanding are critical capabilities for large language models (LLMs). Thus, most LLMs are pretrained and fine-tuned on code data. However, these datasets typically treat code as static strings and rarely exploit the dynamic information about their execution. Building upon previous work on trace modeling, we study Execution Tuning (E.T.), a training procedure in which we explicitly model real-world program execution traces without requiring manual test annotations. We train and evaluate models on different execution trace granularities (line and instruction-level) and strategies on the task of output prediction, obtaining around 80% accuracy on CruxEval and MBPP, and showing the advantages of dynamic scratchpads (i.e., self-contained intermediate computations updated by the model rather than accumulated as a history of past computations) on long executions (up to 14k steps). Finally, we discuss E.T.'s practical applications.

What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces

TL;DR

The paper tackles the gap that Code LLMs treat code as static text and rarely leverage execution traces. It introduces Execution Tuning (ET), a training framework that explicitly models real program traces across multiple granularities using a Python tracer and synthetic inputs, enabling scalable trace datasets. It compares three scratchpad strategies (Scratchpad, Compact Scratchpad, Dynamic Scratchpad) and demonstrates that trace-based training improves output prediction on CruxEval and MBPP, with dynamic scratchpads delivering the strongest gains on very long executions. The findings indicate that dynamic scratchpads are particularly beneficial for long-running or complex executions, while downstream gains on traditional coding benchmarks are modest, suggesting ET’s main impact may lie in program-state understanding and debugging-oriented tasks. The work sets the stage for broader trace-based reasoning across languages and dynamic execution contexts.

Abstract

Code generation and understanding are critical capabilities for large language models (LLMs). Thus, most LLMs are pretrained and fine-tuned on code data. However, these datasets typically treat code as static strings and rarely exploit the dynamic information about their execution. Building upon previous work on trace modeling, we study Execution Tuning (E.T.), a training procedure in which we explicitly model real-world program execution traces without requiring manual test annotations. We train and evaluate models on different execution trace granularities (line and instruction-level) and strategies on the task of output prediction, obtaining around 80% accuracy on CruxEval and MBPP, and showing the advantages of dynamic scratchpads (i.e., self-contained intermediate computations updated by the model rather than accumulated as a history of past computations) on long executions (up to 14k steps). Finally, we discuss E.T.'s practical applications.

Paper Structure

This paper contains 19 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Given a natural number, a function returns the number of iterations required to arrive at 1, when following the sequence in the Collatz conjecture. Can we predict the output of such a function for large inputs (3038 in our example) using LLMs? Asking an LLM to directly predict the output results in a plausible but incorrect answer. Training a model to predict the intermediate traces of the function as a scratchpad of intermediate computations scratchpad generally yields more accurate output predictions, but can be impractical or even inaccurate with long executions. In this work, we introduce dynamic scratchpads, in which the model updates a single, self-contained scratchpad instance, yielding to more accurate predictions for long executions.
  • Figure 2: Overview of the data pipeline in E.T. We start from Python functions made executable with synthetic yet representative inputs generated by a combination of LLMs and fuzzing, filtered by test quality. Our custom Python tracer generates a structured dataset of traces. From this dataset, we train models prompted with different trace representations.
  • Figure 3: Prompt for Instruction-1.
  • Figure 4: Plot showing individual state prediction accuracy (e.g., for Return, specifically for this plot and unlike in the rest of the article, we mean return statement accuracy, not full execution accuracy) when increasing N lines into the future, compared to the predictions Negative Log-Likelihood. Accuracy (bars) gets lower as the number of steps into the future increases, and confidence decreases as well (i.e., NLL increases).
  • Figure 5: Plot showing individual state prediction performance when increasing N instructions into the future, compared to the predictions NLL. NLL stdev omitted for clarity.