Table of Contents
Fetching ...

Code Execution as Grounded Supervision for LLM Reasoning

Dongwon Jung, Wenxuan Zhou, Muhao Chen

TL;DR

The paper tackles the challenge of scalable, reliable chain-of-thought supervision by grounding reasoning in verifiable code execution traces and translating them into natural-language CoT for supervised fine-tuning. It builds a two-stage data pipeline: (i) extract verifiable execution traces from open-source Python solutions and (ii) translate these traces into fluent CoT rationales, producing a high-quality dataset for training. Experiments across math, coding, and reasoning benchmarks demonstrate that models trained with this grounded supervision outperform baselines and exhibit reduced token usage due to fewer repetitions and overthinking. The approach offers annotation-free scalability and transferable reasoning improvements across diverse domains, though it is best suited for tasks that can be expressed as executable code.

Abstract

Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.

Code Execution as Grounded Supervision for LLM Reasoning

TL;DR

The paper tackles the challenge of scalable, reliable chain-of-thought supervision by grounding reasoning in verifiable code execution traces and translating them into natural-language CoT for supervised fine-tuning. It builds a two-stage data pipeline: (i) extract verifiable execution traces from open-source Python solutions and (ii) translate these traces into fluent CoT rationales, producing a high-quality dataset for training. Experiments across math, coding, and reasoning benchmarks demonstrate that models trained with this grounded supervision outperform baselines and exhibit reduced token usage due to fewer repetitions and overthinking. The approach offers annotation-free scalability and transferable reasoning improvements across diverse domains, though it is best suited for tasks that can be expressed as executable code.

Abstract

Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.

Paper Structure

This paper contains 25 sections, 3 equations, 1 figure, 10 tables.

Figures (1)

  • Figure 1: An overview of our method. The translated execution trace is grounded in code execution, making it a reliable and accurate source of reasoning supervision for the LLM.