Table of Contents
Fetching ...

SymCode: A Neurosymbolic Approach to Mathematical Reasoning via Verifiable Code Generation

Sina Bagheri Nezhad, Yao Li, Ameeta Agrawal

TL;DR

The paper addresses the unreliability of prose-based mathematical reasoning in large language models (LLMs) and introduces SymCode, a neurosymbolic framework that converts problems into verifiable Python scripts using SymPy. By treating the code as the reasoning trace and employing a self-debugging loop, SymCode achieves higher accuracy on challenging benchmarks such as MATH-500, OlympiadBench, and AIME, while significantly improving token efficiency and shifting failures from opaque arithmetic mistakes to transparent programmatic errors. The approach demonstrates state-of-the-art performance, particularly for models with strong coding fluency, and highlights the value of grounding neural reasoning in symbolic computation for trustworthy formal problem solving. The work suggests extensions to other formal domains (e.g., physics, logic) and outlines future directions for enhanced self-debugging and multi-modal reasoning.

Abstract

Large Language Models (LLMs) often struggle with complex mathematical reasoning, where prose-based generation leads to unverified and arithmetically unsound solutions. Current prompting strategies like Chain of Thought still operate within this unreliable medium, lacking a mechanism for deterministic verification. To address these limitations, we introduce SymCode, a neurosymbolic framework that reframes mathematical problem-solving as a task of verifiable code generation using the SymPy library. We evaluate SymCode on challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating significant accuracy improvements of up to 13.6 percentage points over baselines. Our analysis shows that SymCode is not only more token-efficient but also fundamentally shifts model failures from opaque logical fallacies towards transparent, programmatic errors. By grounding LLM reasoning in a deterministic symbolic engine, SymCode represents a key step towards more accurate and trustworthy AI in formal domains.

SymCode: A Neurosymbolic Approach to Mathematical Reasoning via Verifiable Code Generation

TL;DR

The paper addresses the unreliability of prose-based mathematical reasoning in large language models (LLMs) and introduces SymCode, a neurosymbolic framework that converts problems into verifiable Python scripts using SymPy. By treating the code as the reasoning trace and employing a self-debugging loop, SymCode achieves higher accuracy on challenging benchmarks such as MATH-500, OlympiadBench, and AIME, while significantly improving token efficiency and shifting failures from opaque arithmetic mistakes to transparent programmatic errors. The approach demonstrates state-of-the-art performance, particularly for models with strong coding fluency, and highlights the value of grounding neural reasoning in symbolic computation for trustworthy formal problem solving. The work suggests extensions to other formal domains (e.g., physics, logic) and outlines future directions for enhanced self-debugging and multi-modal reasoning.

Abstract

Large Language Models (LLMs) often struggle with complex mathematical reasoning, where prose-based generation leads to unverified and arithmetically unsound solutions. Current prompting strategies like Chain of Thought still operate within this unreliable medium, lacking a mechanism for deterministic verification. To address these limitations, we introduce SymCode, a neurosymbolic framework that reframes mathematical problem-solving as a task of verifiable code generation using the SymPy library. We evaluate SymCode on challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating significant accuracy improvements of up to 13.6 percentage points over baselines. Our analysis shows that SymCode is not only more token-efficient but also fundamentally shifts model failures from opaque logical fallacies towards transparent, programmatic errors. By grounding LLM reasoning in a deterministic symbolic engine, SymCode represents a key step towards more accurate and trustworthy AI in formal domains.

Paper Structure

This paper contains 27 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the SymCode framework. A natural language problem is translated into Python code by the LLM, executed, and iteratively refined through error feedback until successful execution or a retry limit is reached.
  • Figure 2: Performance gain of SymCode and SymCode+ over the best prose-based baseline. The advantage of the SymCode framework is most significant on the most difficult datasets (OlympiadBench and AIME).
  • Figure 3: Activation rate of the self-debugging loop. The loop was required most often for Llama 3.2, correlating with its weaker initial coding performance.