Table of Contents
Fetching ...

ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning

Lingxiao Tang, He Ye, Zhaoyang Chu, Muyang Ye, Zhongxin Liu, Xiaoxue Ren, Lingfeng Bao

Abstract

Code LLMs still struggle with code execution reasoning, especially in smaller models. Existing methods rely on supervised fine-tuning (SFT) with teacher-generated explanations, primarily in two forms: (1) input-output (I/O) prediction chains and (2) natural-language descriptions of execution traces. However, intermediate execution steps cannot be explicitly verified during SFT, so the training objective can reduce to merely matching teacher explanations. Moreover, training data is typically collected without explicit control over task difficulty. We introduce ExecVerify, which goes beyond text imitation by incorporating verifiable white-box rewards derived from execution traces, including next-statement prediction and variable value/type prediction. Our work first builds a dataset with multiple difficulty levels via constraint-based program synthesis. Then, we apply reinforcement learning (RL) to reward correct answers about both intermediate execution steps and final outputs, aligning the training objective with semantic correctness at each execution step. Finally, we adopt a two-stage training pipeline that first enhances execution reasoning and then transfers to code generation. Experiments demonstrate that a 7B model trained with ExecVerify achieves performance comparable to 32B models on code reasoning benchmarks and improves pass@1 by up to 5.9\% on code generation tasks over strong post-training baselines.

ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning

Abstract

Code LLMs still struggle with code execution reasoning, especially in smaller models. Existing methods rely on supervised fine-tuning (SFT) with teacher-generated explanations, primarily in two forms: (1) input-output (I/O) prediction chains and (2) natural-language descriptions of execution traces. However, intermediate execution steps cannot be explicitly verified during SFT, so the training objective can reduce to merely matching teacher explanations. Moreover, training data is typically collected without explicit control over task difficulty. We introduce ExecVerify, which goes beyond text imitation by incorporating verifiable white-box rewards derived from execution traces, including next-statement prediction and variable value/type prediction. Our work first builds a dataset with multiple difficulty levels via constraint-based program synthesis. Then, we apply reinforcement learning (RL) to reward correct answers about both intermediate execution steps and final outputs, aligning the training objective with semantic correctness at each execution step. Finally, we adopt a two-stage training pipeline that first enhances execution reasoning and then transfers to code generation. Experiments demonstrate that a 7B model trained with ExecVerify achieves performance comparable to 32B models on code reasoning benchmarks and improves pass@1 by up to 5.9\% on code generation tasks over strong post-training baselines.
Paper Structure (70 sections, 4 equations, 19 figures, 9 tables)

This paper contains 70 sections, 4 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Comparison between SFT and white-box RL. (a) Code snippet. (b) Execution steps extracted from the interpreter, with the relevant parts highlighted in yellow. (c) SFT optimizes the cross-entropy loss over the entire sequence, without explicitly verifying execution details like variable values or control flow. (d) In contrast, white-box RL leverages interpreter-provided execution steps to assign verifiable and step-level rewards.
  • Figure 2: Overview of our approach. Step 1 constructs a constraint-based dataset of executable Python snippets. Step 2 performs two-stage post-training: white-box RL for code reasoning followed by RL for code generation.
  • Figure 3: Data efficiency comparison at a fixed training scale (15K examples). We report Pass@1 on CRUXEval-O and LiveCodeBench-Exec for models fine-tuned with different datasets.
  • Figure 4: Ablation study on our synthesis pipeline on CRUXEval-O and LiveCodeBench-Exec. We report pass@1 on models finetuned with different data synthesis variants.
  • Figure 5: CRUXEval-X Multilingual I/O Prediction: Comparison with Qwen2.5-Coder-Instruct (7B/32B).
  • ...and 14 more figures