Neuro-Symbolic Data Generation for Math Reasoning

Zenan Li; Zhi Zhou; Yuan Yao; Yu-Feng Li; Chun Cao; Fan Yang; Xian Zhang; Xiaoxing Ma

Neuro-Symbolic Data Generation for Math Reasoning

Zenan Li, Zhi Zhou, Yuan Yao, Yu-Feng Li, Chun Cao, Fan Yang, Xian Zhang, Xiaoxing Ma

TL;DR

The paper investigates whether gaps in LLM mathematical reasoning are intrinsic or data-driven and proposes a neuro-symbolic data-generation pipeline that mutates symbolic math problems and then informalizes them into natural-language questions. By formalizing problems in SMT-LIB, applying controlled simplification and complication mutations, and using projected MCMC for diversity, the method creates large, valid datasets verified by symbolic solvers and paired with GPT-4 generated reasoning paths. Fine-tuning multiple models (e.g., LLaMA-2, Mistral) on the generated data yields state-of-the-art results on GSM8K and MATH, with robust performance on out-of-domain benchmarks and evidence of scalability. The approach demonstrates how combining symbolic rigor with neural flexibility can address data scarcity and enhance math reasoning in LLMs, while outlining limitations and future directions for broader mutation expressiveness and solver coverage.

Abstract

A critical question about Large Language Models (LLMs) is whether their apparent deficiency in mathematical reasoning is inherent, or merely a result of insufficient exposure to high-quality mathematical data. To explore this, we developed an automated method for generating high-quality, supervised mathematical datasets. The method carefully mutates existing math problems, ensuring both diversity and validity of the newly generated problems. This is achieved by a neuro-symbolic data generation framework combining the intuitive informalization strengths of LLMs, and the precise symbolic reasoning of math solvers along with projected Markov chain Monte Carlo sampling in the highly-irregular symbolic space. Empirical experiments demonstrate the high quality of data generated by the proposed method, and that the LLMs, specifically LLaMA-2 and Mistral, when realigned with the generated data, surpass their state-of-the-art counterparts.

Neuro-Symbolic Data Generation for Math Reasoning

TL;DR

Abstract

Neuro-Symbolic Data Generation for Math Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)