SymCode: A Neurosymbolic Approach to Mathematical Reasoning via Verifiable Code Generation
Sina Bagheri Nezhad, Yao Li, Ameeta Agrawal
TL;DR
The paper addresses the unreliability of prose-based mathematical reasoning in large language models (LLMs) and introduces SymCode, a neurosymbolic framework that converts problems into verifiable Python scripts using SymPy. By treating the code as the reasoning trace and employing a self-debugging loop, SymCode achieves higher accuracy on challenging benchmarks such as MATH-500, OlympiadBench, and AIME, while significantly improving token efficiency and shifting failures from opaque arithmetic mistakes to transparent programmatic errors. The approach demonstrates state-of-the-art performance, particularly for models with strong coding fluency, and highlights the value of grounding neural reasoning in symbolic computation for trustworthy formal problem solving. The work suggests extensions to other formal domains (e.g., physics, logic) and outlines future directions for enhanced self-debugging and multi-modal reasoning.
Abstract
Large Language Models (LLMs) often struggle with complex mathematical reasoning, where prose-based generation leads to unverified and arithmetically unsound solutions. Current prompting strategies like Chain of Thought still operate within this unreliable medium, lacking a mechanism for deterministic verification. To address these limitations, we introduce SymCode, a neurosymbolic framework that reframes mathematical problem-solving as a task of verifiable code generation using the SymPy library. We evaluate SymCode on challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating significant accuracy improvements of up to 13.6 percentage points over baselines. Our analysis shows that SymCode is not only more token-efficient but also fundamentally shifts model failures from opaque logical fallacies towards transparent, programmatic errors. By grounding LLM reasoning in a deterministic symbolic engine, SymCode represents a key step towards more accurate and trustworthy AI in formal domains.
