What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces

Jordi Armengol-Estapé; Quentin Carbonneaux; Tianjun Zhang; Aram H. Markosyan; Volker Seeker; Chris Cummins; Melanie Kambadur; Michael F. P. O'Boyle; Sida Wang; Gabriel Synnaeve; Hugh James Leather

What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces

Jordi Armengol-Estapé, Quentin Carbonneaux, Tianjun Zhang, Aram H. Markosyan, Volker Seeker, Chris Cummins, Melanie Kambadur, Michael F. P. O'Boyle, Sida Wang, Gabriel Synnaeve, Hugh James Leather

TL;DR

The paper tackles the gap that Code LLMs treat code as static text and rarely leverage execution traces. It introduces Execution Tuning (ET), a training framework that explicitly models real program traces across multiple granularities using a Python tracer and synthetic inputs, enabling scalable trace datasets. It compares three scratchpad strategies (Scratchpad, Compact Scratchpad, Dynamic Scratchpad) and demonstrates that trace-based training improves output prediction on CruxEval and MBPP, with dynamic scratchpads delivering the strongest gains on very long executions. The findings indicate that dynamic scratchpads are particularly beneficial for long-running or complex executions, while downstream gains on traditional coding benchmarks are modest, suggesting ET’s main impact may lie in program-state understanding and debugging-oriented tasks. The work sets the stage for broader trace-based reasoning across languages and dynamic execution contexts.

Abstract

Code generation and understanding are critical capabilities for large language models (LLMs). Thus, most LLMs are pretrained and fine-tuned on code data. However, these datasets typically treat code as static strings and rarely exploit the dynamic information about their execution. Building upon previous work on trace modeling, we study Execution Tuning (E.T.), a training procedure in which we explicitly model real-world program execution traces without requiring manual test annotations. We train and evaluate models on different execution trace granularities (line and instruction-level) and strategies on the task of output prediction, obtaining around 80% accuracy on CruxEval and MBPP, and showing the advantages of dynamic scratchpads (i.e., self-contained intermediate computations updated by the model rather than accumulated as a history of past computations) on long executions (up to 14k steps). Finally, we discuss E.T.'s practical applications.

What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces

TL;DR

Abstract

What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)