Table of Contents
Fetching ...

Enhancing Mathematical Problem Solving in LLMs through Execution-Driven Reasoning Augmentation

Aditya Basarkar, Benyamin Tabarsi, Tiffany Barnes, Dongkuan, Xu

TL;DR

The paper tackles the challenge of reliable multi-step mathematical reasoning in LLMs by introducing Iteratively Improved Program Construction (IIPC), a dual-branch framework that couples a token-based Chain-of-Thought with an executable program refinement branch. IIPC maintains a memory of past mistakes and iteratively refines a program using execution feedback, while keeping high-level reasoning stable through a separate CoT trace. Empirical results on MATH and AIME across multiple base LLMs show IIPC achieving state-of-the-art or competitive performance, with ablations highlighting the importance of iterative refinement, reflection memory, and dual-branch integration. The work demonstrates that blending symbolic program execution with natural-language reasoning can improve trajectory correction and fault avoidance, albeit with higher token costs and model-capacity requirements.”

Abstract

Mathematical problem solving is a fundamental benchmark for assessing the reasoning capabilities of artificial intelligence and a gateway to applications in education, science, and engineering where reliable symbolic reasoning is essential. Although recent advances in multi-agent LLM-based systems have enhanced their mathematical reasoning capabilities, they still lack a reliably revisable representation of the reasoning process. Existing agents either operate in rigid sequential pipelines that cannot correct earlier steps or rely on heuristic self-evaluation that can fail to identify and fix errors. In addition, programmatic context can distract language models and degrade accuracy. To address these gaps, we introduce Iteratively Improved Program Construction (IIPC), a reasoning method that iteratively refines programmatic reasoning chains and combines execution feedback with the native Chain-of-thought abilities of the base LLM to maintain high-level contextual focus. IIPC surpasses competing approaches in the majority of reasoning benchmarks on multiple base LLMs. All code and implementations are released as open source.

Enhancing Mathematical Problem Solving in LLMs through Execution-Driven Reasoning Augmentation

TL;DR

The paper tackles the challenge of reliable multi-step mathematical reasoning in LLMs by introducing Iteratively Improved Program Construction (IIPC), a dual-branch framework that couples a token-based Chain-of-Thought with an executable program refinement branch. IIPC maintains a memory of past mistakes and iteratively refines a program using execution feedback, while keeping high-level reasoning stable through a separate CoT trace. Empirical results on MATH and AIME across multiple base LLMs show IIPC achieving state-of-the-art or competitive performance, with ablations highlighting the importance of iterative refinement, reflection memory, and dual-branch integration. The work demonstrates that blending symbolic program execution with natural-language reasoning can improve trajectory correction and fault avoidance, albeit with higher token costs and model-capacity requirements.”

Abstract

Mathematical problem solving is a fundamental benchmark for assessing the reasoning capabilities of artificial intelligence and a gateway to applications in education, science, and engineering where reliable symbolic reasoning is essential. Although recent advances in multi-agent LLM-based systems have enhanced their mathematical reasoning capabilities, they still lack a reliably revisable representation of the reasoning process. Existing agents either operate in rigid sequential pipelines that cannot correct earlier steps or rely on heuristic self-evaluation that can fail to identify and fix errors. In addition, programmatic context can distract language models and degrade accuracy. To address these gaps, we introduce Iteratively Improved Program Construction (IIPC), a reasoning method that iteratively refines programmatic reasoning chains and combines execution feedback with the native Chain-of-thought abilities of the base LLM to maintain high-level contextual focus. IIPC surpasses competing approaches in the majority of reasoning benchmarks on multiple base LLMs. All code and implementations are released as open source.
Paper Structure (28 sections, 7 equations, 7 figures, 5 tables)

This paper contains 28 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of IIPC. $f_{\text{init}}$ derives key propositions from the problem statement; $f_{\text{prog}}$ generates an initial candidate program; $f_{\text{val}}$ evaluates program correctness and logical consistency; if errors are detected, the error correction component $f_{\text{err}}$ revises the program accordingly; $f_{\text{cot}}$ produces a textual chain of thought;$f_{\text{comb}}$ combines program and token reasoning context for final output; $M_t$ denotes the error descriptor memory at refinement step $t$; $P_t$ represents the program store at step $t$.
  • Figure 2: Accuracy of PoT, IIPC, CR, and MACM on the MATH dataset using Llama 4 Maverick. This bargraph is stratified by difficulty level
  • Figure 3: Heatmap of accuracy (%) by subject area on the MATH benchmark for Llama-4-Maverick. Columns correspond to reasoning agents (PoT, IIPC, CR, and MACM) and rows correspond to mathematical domains.
  • Figure 4: Accuracy by difficulty level for each LLM, comparing PoT, CR, MACM, and IIPC agents on the MATH dataset.
  • Figure 5: Heatmaps showing accuracy by topic for each LLM, comparing PoT, CR, MACM, and IIPC agents on the MATH dataset.
  • ...and 2 more figures