Chain of Execution Supervision Promotes General Reasoning in Large Language Models

Nuo Chen; Zehua Li; Keqin Bao; Junyang Lin; Dayiheng Liu

Chain of Execution Supervision Promotes General Reasoning in Large Language Models

Nuo Chen, Zehua Li, Keqin Bao, Junyang Lin, Dayiheng Liu

TL;DR

TracePile introduces a large-scale, step-by-step Chain of Execution corpus that converts code execution into explicit reasoning traces to boost general reasoning across mathematics, algorithms, and programming tasks. By combining multi-source data, diverse augmentation, and structured CoE generation, TracePile provides a richer supervision signal than traditional final-answer training. Empirical results across multiple base models and 20 benchmarks show consistent gains in both in-domain and out-of-domain settings, with notable improvements in multi-step, state-tracking, and algorithmic reasoning. The work highlights how explicit execution traces can improve transferability and robustness, while also detailing limitations and directions for broader coverage and longer CoE traces.

Abstract

Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code suboptimal.To address this, we introduce TracePile, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought-style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and code diversity. We evaluate TracePile using three training setups: continue-pretraining, instruction tuning after pretraining, and two-stage finetuning. Experiments across four base models (LLaMA 3, LLaMA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. Notably, TracePile boosts LLaMA3.1-8B by 7.1\% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and MMLU under two-stage fine-tuning.

Chain of Execution Supervision Promotes General Reasoning in Large Language Models

TL;DR

Abstract

Chain of Execution Supervision Promotes General Reasoning in Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)