Table of Contents
Fetching ...

Self-Execution Simulation Improves Coding Models

Gallil Maimon, Ori Yoran, Felix Kreuk, Michael Hassid, Gal Cohen, Pierre Chambon, Yossi Adi

Abstract

A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program execution, particularly for code they generate. In this work, we demonstrate that Code LLMs can be trained to simulate program execution in a step-by-step manner and that this capability can be leveraged to improve competitive programming performance. Our approach combines supervised fine-tuning on natural language execution traces, textual explanations grounded in true execution, with reinforcement learning using verifiable rewards. We introduce two complementary objectives: output prediction given code and inputs, and solving competitive programming tasks with either ground-truth or self-predicted execution feedback. These objectives enable models to perform self-verification over multiple candidate solutions, and iterative self-fixing by simulating test execution. Across multiple competitive programming benchmarks, our method yields consistent improvements over standard reasoning approaches. We further present ablations and analysis to elucidate the role of execution simulation and its limitations.

Self-Execution Simulation Improves Coding Models

Abstract

A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program execution, particularly for code they generate. In this work, we demonstrate that Code LLMs can be trained to simulate program execution in a step-by-step manner and that this capability can be leveraged to improve competitive programming performance. Our approach combines supervised fine-tuning on natural language execution traces, textual explanations grounded in true execution, with reinforcement learning using verifiable rewards. We introduce two complementary objectives: output prediction given code and inputs, and solving competitive programming tasks with either ground-truth or self-predicted execution feedback. These objectives enable models to perform self-verification over multiple candidate solutions, and iterative self-fixing by simulating test execution. Across multiple competitive programming benchmarks, our method yields consistent improvements over standard reasoning approaches. We further present ablations and analysis to elucidate the role of execution simulation and its limitations.

Paper Structure

This paper contains 36 sections, 1 equation, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: A conceptual outline of how one can use self-execution simulation of a generated code solution (or solutions) on public or generated test cases to improve coding performance. The simulation can be used as feedback to select the best solution from a few candidates (best@k) or to iteratively fix the code as needed (self-RLEF). See Section \ref{['sec:solving_with_sim']} for details.
  • Figure 2: The two parts of our training pipeline. 1) Supervised fine tuning on natural language execution traces (NLEX), 2) Multi-task reinforcement learning on output prediction and competitive programming (optionally with multi-turn feedback and fixing).
  • Figure 3: CruxEval-O performance compared to model active parameters. Arrows demonstrate the benefit from training on NLEX data. We also compare to open models.
  • Figure 4: Best@k performance of self-verification with self-simulation. Solutions and output predictions are produced by the same model - based on Qwen2.5-7B or CWM, trained for both solving and output prediction. Even though the tests used for filtering are in the solve prompt, there is still room for notable gains from simulating them.
  • Figure 5: Comparing best@k when ranking Qwen3-32B solutions, using CWM post-trained only for output prediction as a verifier.
  • ...and 4 more figures