Table of Contents
Fetching ...

Neuro-Symbolic Data Generation for Math Reasoning

Zenan Li, Zhi Zhou, Yuan Yao, Yu-Feng Li, Chun Cao, Fan Yang, Xian Zhang, Xiaoxing Ma

TL;DR

The paper investigates whether gaps in LLM mathematical reasoning are intrinsic or data-driven and proposes a neuro-symbolic data-generation pipeline that mutates symbolic math problems and then informalizes them into natural-language questions. By formalizing problems in SMT-LIB, applying controlled simplification and complication mutations, and using projected MCMC for diversity, the method creates large, valid datasets verified by symbolic solvers and paired with GPT-4 generated reasoning paths. Fine-tuning multiple models (e.g., LLaMA-2, Mistral) on the generated data yields state-of-the-art results on GSM8K and MATH, with robust performance on out-of-domain benchmarks and evidence of scalability. The approach demonstrates how combining symbolic rigor with neural flexibility can address data scarcity and enhance math reasoning in LLMs, while outlining limitations and future directions for broader mutation expressiveness and solver coverage.

Abstract

A critical question about Large Language Models (LLMs) is whether their apparent deficiency in mathematical reasoning is inherent, or merely a result of insufficient exposure to high-quality mathematical data. To explore this, we developed an automated method for generating high-quality, supervised mathematical datasets. The method carefully mutates existing math problems, ensuring both diversity and validity of the newly generated problems. This is achieved by a neuro-symbolic data generation framework combining the intuitive informalization strengths of LLMs, and the precise symbolic reasoning of math solvers along with projected Markov chain Monte Carlo sampling in the highly-irregular symbolic space. Empirical experiments demonstrate the high quality of data generated by the proposed method, and that the LLMs, specifically LLaMA-2 and Mistral, when realigned with the generated data, surpass their state-of-the-art counterparts.

Neuro-Symbolic Data Generation for Math Reasoning

TL;DR

The paper investigates whether gaps in LLM mathematical reasoning are intrinsic or data-driven and proposes a neuro-symbolic data-generation pipeline that mutates symbolic math problems and then informalizes them into natural-language questions. By formalizing problems in SMT-LIB, applying controlled simplification and complication mutations, and using projected MCMC for diversity, the method creates large, valid datasets verified by symbolic solvers and paired with GPT-4 generated reasoning paths. Fine-tuning multiple models (e.g., LLaMA-2, Mistral) on the generated data yields state-of-the-art results on GSM8K and MATH, with robust performance on out-of-domain benchmarks and evidence of scalability. The approach demonstrates how combining symbolic rigor with neural flexibility can address data scarcity and enhance math reasoning in LLMs, while outlining limitations and future directions for broader mutation expressiveness and solver coverage.

Abstract

A critical question about Large Language Models (LLMs) is whether their apparent deficiency in mathematical reasoning is inherent, or merely a result of insufficient exposure to high-quality mathematical data. To explore this, we developed an automated method for generating high-quality, supervised mathematical datasets. The method carefully mutates existing math problems, ensuring both diversity and validity of the newly generated problems. This is achieved by a neuro-symbolic data generation framework combining the intuitive informalization strengths of LLMs, and the precise symbolic reasoning of math solvers along with projected Markov chain Monte Carlo sampling in the highly-irregular symbolic space. Empirical experiments demonstrate the high quality of data generated by the proposed method, and that the LLMs, specifically LLaMA-2 and Mistral, when realigned with the generated data, surpass their state-of-the-art counterparts.

Paper Structure

This paper contains 21 sections, 2 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: The overview of our neuro-symbolic data generation framework. The framework comprises three steps: (1) Formalize the seed problem into its symbolic version. (2) Mutate the symbolic problem to create new variants. (3) Translate the variants in symbolic form back to the natural language version. Additionally, we prompt GPT-4 to generate reasoning paths, which are verified by symbolic solvers, as part of the supervision.
  • Figure 2: The performance of our proposed mutation mechanism. The first figure illustrates that the generated problems with higher difficulty levels lead to more reasoning steps of GPT-4. The second figure shows that the gradual incorporation of more difficult problems consistently improves the LLM's reasoning capability.
  • Figure 3: Performance curves of the LLaMA-2-7B models fine-tuned on various scales of datasets. The two datasets are generated by our approach and MetaMath (MMQA). The performance can be consistently enhanced by increasing the amount of data generated using the proposed framework.
  • Figure 4: The diversity gain across all difficulty levels. The results indicate that the diversity gain of the Mix version continues to increase and reaches the highest compared with alternatives as the data budget increases.